Internal change
PiperOrigin-RevId: 271275031
Change-Id: I69bce2b27644a3fff7bc445c567c8fab4a8ff234
diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000..baf0444
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,459 @@
+ GNU LESSER GENERAL PUBLIC LICENSE
+ Version 2.1, February 1999
+
+ Copyright (C) 1991, 1999 Free Software Foundation, Inc.
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+[This is the first released version of the Lesser GPL. It also counts
+ as the successor of the GNU Library Public License, version 2, hence
+ the version number 2.1.]
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+Licenses are intended to guarantee your freedom to share and change
+free software--to make sure the software is free for all its users.
+
+ This license, the Lesser General Public License, applies to some
+specially designated software packages--typically libraries--of the
+Free Software Foundation and other authors who decide to use it. You
+can use it too, but we suggest you first think carefully about whether
+this license or the ordinary General Public License is the better
+strategy to use in any particular case, based on the explanations below.
+
+ When we speak of free software, we are referring to freedom of use,
+not price. Our General Public Licenses are designed to make sure that
+you have the freedom to distribute copies of free software (and charge
+for this service if you wish); that you receive source code or can get
+it if you want it; that you can change the software and use pieces of
+it in new free programs; and that you are informed that you can do
+these things.
+
+ To protect your rights, we need to make restrictions that forbid
+distributors to deny you these rights or to ask you to surrender these
+rights. These restrictions translate to certain responsibilities for
+you if you distribute copies of the library or if you modify it.
+
+ For example, if you distribute copies of the library, whether gratis
+or for a fee, you must give the recipients all the rights that we gave
+you. You must make sure that they, too, receive or can get the source
+code. If you link other code with the library, you must provide
+complete object files to the recipients, so that they can relink them
+with the library after making changes to the library and recompiling
+it. And you must show them these terms so they know their rights.
+
+ We protect your rights with a two-step method: (1) we copyright the
+library, and (2) we offer you this license, which gives you legal
+permission to copy, distribute and/or modify the library.
+
+ To protect each distributor, we want to make it very clear that
+there is no warranty for the free library. Also, if the library is
+modified by someone else and passed on, the recipients should know
+that what they have is not the original version, so that the original
+author's reputation will not be affected by problems that might be
+introduced by others.
+
+ Finally, software patents pose a constant threat to the existence of
+any free program. We wish to make sure that a company cannot
+effectively restrict the users of a free program by obtaining a
+restrictive license from a patent holder. Therefore, we insist that
+any patent license obtained for a version of the library must be
+consistent with the full freedom of use specified in this license.
+
+ Most GNU software, including some libraries, is covered by the
+ordinary GNU General Public License. This license, the GNU Lesser
+General Public License, applies to certain designated libraries, and
+is quite different from the ordinary General Public License. We use
+this license for certain libraries in order to permit linking those
+libraries into non-free programs.
+
+ When a program is linked with a library, whether statically or using
+a shared library, the combination of the two is legally speaking a
+combined work, a derivative of the original library. The ordinary
+General Public License therefore permits such linking only if the
+entire combination fits its criteria of freedom. The Lesser General
+Public License permits more lax criteria for linking other code with
+the library.
+
+ We call this license the "Lesser" General Public License because it
+does Less to protect the user's freedom than the ordinary General
+Public License. It also provides other free software developers Less
+of an advantage over competing non-free programs. These disadvantages
+are the reason we use the ordinary General Public License for many
+libraries. However, the Lesser license provides advantages in certain
+special circumstances.
+
+ For example, on rare occasions, there may be a special need to
+encourage the widest possible use of a certain library, so that it becomes
+a de-facto standard. To achieve this, non-free programs must be
+allowed to use the library. A more frequent case is that a free
+library does the same job as widely used non-free libraries. In this
+case, there is little to gain by limiting the free library to free
+software only, so we use the Lesser General Public License.
+
+ In other cases, permission to use a particular library in non-free
+programs enables a greater number of people to use a large body of
+free software. For example, permission to use the GNU C Library in
+non-free programs enables many more people to use the whole GNU
+operating system, as well as its variant, the GNU/Linux operating
+system.
+
+ Although the Lesser General Public License is Less protective of the
+users' freedom, it does ensure that the user of a program that is
+linked with the Library has the freedom and the wherewithal to run
+that program using a modified version of the Library.
+
+ The precise terms and conditions for copying, distribution and
+modification follow. Pay close attention to the difference between a
+"work based on the library" and a "work that uses the library". The
+former contains code derived from the library, whereas the latter must
+be combined with the library in order to run.
+
+ GNU LESSER GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License Agreement applies to any software library or other
+program which contains a notice placed by the copyright holder or
+other authorized party saying it may be distributed under the terms of
+this Lesser General Public License (also called "this License").
+Each licensee is addressed as "you".
+
+ A "library" means a collection of software functions and/or data
+prepared so as to be conveniently linked with application programs
+(which use some of those functions and data) to form executables.
+
+ The "Library", below, refers to any such software library or work
+which has been distributed under these terms. A "work based on the
+Library" means either the Library or any derivative work under
+copyright law: that is to say, a work containing the Library or a
+portion of it, either verbatim or with modifications and/or translated
+straightforwardly into another language. (Hereinafter, translation is
+included without limitation in the term "modification".)
+
+ "Source code" for a work means the preferred form of the work for
+making modifications to it. For a library, complete source code means
+all the source code for all modules it contains, plus any associated
+interface definition files, plus the scripts used to control compilation
+and installation of the library.
+
+ Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running a program using the Library is not restricted, and output from
+such a program is covered only if its contents constitute a work based
+on the Library (independent of the use of the Library in a tool for
+writing it). Whether that is true depends on what the Library does
+and what the program that uses the Library does.
+
+ 1. You may copy and distribute verbatim copies of the Library's
+complete source code as you receive it, in any medium, provided that
+you conspicuously and appropriately publish on each copy an
+appropriate copyright notice and disclaimer of warranty; keep intact
+all the notices that refer to this License and to the absence of any
+warranty; and distribute a copy of this License along with the
+Library.
+
+ You may charge a fee for the physical act of transferring a copy,
+and you may at your option offer warranty protection in exchange for a
+fee.
+
+ 2. You may modify your copy or copies of the Library or any portion
+of it, thus forming a work based on the Library, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) The modified work must itself be a software library.
+
+ b) You must cause the files modified to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ c) You must cause the whole of the work to be licensed at no
+ charge to all third parties under the terms of this License.
+
+ d) If a facility in the modified Library refers to a function or a
+ table of data to be supplied by an application program that uses
+ the facility, other than as an argument passed when the facility
+ is invoked, then you must make a good faith effort to ensure that,
+ in the event an application does not supply such function or
+ table, the facility still operates, and performs whatever part of
+ its purpose remains meaningful.
+
+ (For example, a function in a library to compute square roots has
+ a purpose that is entirely well-defined independent of the
+ application. Therefore, Subsection 2d requires that any
+ application-supplied function or table used by this function must
+ be optional: if the application does not supply it, the square
+ root function must still compute square roots.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Library,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Library, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote
+it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Library.
+
+In addition, mere aggregation of another work not based on the Library
+with the Library (or with a work based on the Library) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may opt to apply the terms of the ordinary GNU General Public
+License instead of this License to a given copy of the Library. To do
+this, you must alter all the notices that refer to this License, so
+that they refer to the ordinary GNU General Public License, version 2,
+instead of to this License. (If a newer version than version 2 of the
+ordinary GNU General Public License has appeared, then you can specify
+that version instead if you wish.) Do not make any other change in
+these notices.
+
+ Once this change is made in a given copy, it is irreversible for
+that copy, so the ordinary GNU General Public License applies to all
+subsequent copies and derivative works made from that copy.
+
+ This option is useful when you wish to copy part of the code of
+the Library into a program that is not a library.
+
+ 4. You may copy and distribute the Library (or a portion or
+derivative of it, under Section 2) in object code or executable form
+under the terms of Sections 1 and 2 above provided that you accompany
+it with the complete corresponding machine-readable source code, which
+must be distributed under the terms of Sections 1 and 2 above on a
+medium customarily used for software interchange.
+
+ If distribution of object code is made by offering access to copy
+from a designated place, then offering equivalent access to copy the
+source code from the same place satisfies the requirement to
+distribute the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 5. A program that contains no derivative of any portion of the
+Library, but is designed to work with the Library by being compiled or
+linked with it, is called a "work that uses the Library". Such a
+work, in isolation, is not a derivative work of the Library, and
+therefore falls outside the scope of this License.
+
+ However, linking a "work that uses the Library" with the Library
+creates an executable that is a derivative of the Library (because it
+contains portions of the Library), rather than a "work that uses the
+library". The executable is therefore covered by this License.
+Section 6 states terms for distribution of such executables.
+
+ When a "work that uses the Library" uses material from a header file
+that is part of the Library, the object code for the work may be a
+derivative work of the Library even though the source code is not.
+Whether this is true is especially significant if the work can be
+linked without the Library, or if the work is itself a library. The
+threshold for this to be true is not precisely defined by law.
+
+ If such an object file uses only numerical parameters, data
+structure layouts and accessors, and small macros and small inline
+functions (ten lines or less in length), then the use of the object
+file is unrestricted, regardless of whether it is legally a derivative
+work. (Executables containing this object code plus portions of the
+Library will still fall under Section 6.)
+
+ Otherwise, if the work is a derivative of the Library, you may
+distribute the object code for the work under the terms of Section 6.
+Any executables containing that work also fall under Section 6,
+whether or not they are linked directly with the Library itself.
+
+ 6. As an exception to the Sections above, you may also combine or
+link a "work that uses the Library" with the Library to produce a
+work containing portions of the Library, and distribute that work
+under terms of your choice, provided that the terms permit
+modification of the work for the customer's own use and reverse
+engineering for debugging such modifications.
+
+ You must give prominent notice with each copy of the work that the
+Library is used in it and that the Library and its use are covered by
+this License. You must supply a copy of this License. If the work
+during execution displays copyright notices, you must include the
+copyright notice for the Library among them, as well as a reference
+directing the user to the copy of this License. Also, you must do one
+of these things:
+
+ a) Accompany the work with the complete corresponding
+ machine-readable source code for the Library including whatever
+ changes were used in the work (which must be distributed under
+ Sections 1 and 2 above); and, if the work is an executable linked
+ with the Library, with the complete machine-readable "work that
+ uses the Library", as object code and/or source code, so that the
+ user can modify the Library and then relink to produce a modified
+ executable containing the modified Library. (It is understood
+ that the user who changes the contents of definitions files in the
+ Library will not necessarily be able to recompile the application
+ to use the modified definitions.)
+
+ b) Use a suitable shared library mechanism for linking with the
+ Library. A suitable mechanism is one that (1) uses at run time a
+ copy of the library already present on the user's computer system,
+ rather than copying library functions into the executable, and (2)
+ will operate properly with a modified version of the library, if
+ the user installs one, as long as the modified version is
+ interface-compatible with the version that the work was made with.
+
+ c) Accompany the work with a written offer, valid for at
+ least three years, to give the same user the materials
+ specified in Subsection 6a, above, for a charge no more
+ than the cost of performing this distribution.
+
+ d) If distribution of the work is made by offering access to copy
+ from a designated place, offer equivalent access to copy the above
+ specified materials from the same place.
+
+ e) Verify that the user has already received a copy of these
+ materials or that you have already sent this user a copy.
+
+ For an executable, the required form of the "work that uses the
+Library" must include any data and utility programs needed for
+reproducing the executable from it. However, as a special exception,
+the materials to be distributed need not include anything that is
+normally distributed (in either source or binary form) with the major
+components (compiler, kernel, and so on) of the operating system on
+which the executable runs, unless that component itself accompanies
+the executable.
+
+ It may happen that this requirement contradicts the license
+restrictions of other proprietary libraries that do not normally
+accompany the operating system. Such a contradiction means you cannot
+use both them and the Library together in an executable that you
+distribute.
+
+ 7. You may place library facilities that are a work based on the
+Library side-by-side in a single library together with other library
+facilities not covered by this License, and distribute such a combined
+library, provided that the separate distribution of the work based on
+the Library and of the other library facilities is otherwise
+permitted, and provided that you do these two things:
+
+ a) Accompany the combined library with a copy of the same work
+ based on the Library, uncombined with any other library
+ facilities. This must be distributed under the terms of the
+ Sections above.
+
+ b) Give prominent notice with the combined library of the fact
+ that part of it is a work based on the Library, and explaining
+ where to find the accompanying uncombined form of the same work.
+
+ 8. You may not copy, modify, sublicense, link with, or distribute
+the Library except as expressly provided under this License. Any
+attempt otherwise to copy, modify, sublicense, link with, or
+distribute the Library is void, and will automatically terminate your
+rights under this License. However, parties who have received copies,
+or rights, from you under this License will not have their licenses
+terminated so long as such parties remain in full compliance.
+
+ 9. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Library or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Library (or any work based on the
+Library), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Library or works based on it.
+
+ 10. Each time you redistribute the Library (or any work based on the
+Library), the recipient automatically receives a license from the
+original licensor to copy, distribute, link with or modify the Library
+subject to these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties with
+this License.
+
+ 11. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Library at all. For example, if a patent
+license would not permit royalty-free redistribution of the Library by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Library.
+
+If any portion of this section is held invalid or unenforceable under any
+particular circumstance, the balance of the section is intended to apply,
+and the section as a whole is intended to apply in other circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 12. If the distribution and/or use of the Library is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Library under this License may add
+an explicit geographical distribution limitation excluding those countries,
+so that distribution is permitted only in or among countries not thus
+excluded. In such case, this License incorporates the limitation as if
+written in the body of this License.
+
+ 13. The Free Software Foundation may publish revised and/or new
+versions of the Lesser General Public License from time to time.
+Such new versions will be similar in spirit to the present version,
+but may differ in detail to address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Library
+specifies a version number of this License which applies to it and
+"any later version", you have the option of following the terms and
+conditions either of that version or of any later version published by
+the Free Software Foundation. If the Library does not specify a
+license version number, you may choose any version ever published by
+the Free Software Foundation.
+
+ 14. If you wish to incorporate parts of the Library into other free
+programs whose distribution conditions are incompatible with these,
+write to the author to ask for permission. For software which is
+copyrighted by the Free Software Foundation, write to the Free
+Software Foundation; we sometimes make exceptions for this. Our
+decision will be guided by the two goals of preserving the free status
+of all derivatives of our free software and of promoting the sharing
+and reuse of software generally.
+
+ NO WARRANTY
+
+ 15. BECAUSE THE LIBRARY IS LICENSED FREE OF CHARGE, THERE IS NO
+WARRANTY FOR THE LIBRARY, TO THE EXTENT PERMITTED BY APPLICABLE LAW.
+EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR
+OTHER PARTIES PROVIDE THE LIBRARY "AS IS" WITHOUT WARRANTY OF ANY
+KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE
+IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE
+LIBRARY IS WITH YOU. SHOULD THE LIBRARY PROVE DEFECTIVE, YOU ASSUME
+THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
+
+ 16. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN
+WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY
+AND/OR REDISTRIBUTE THE LIBRARY AS PERMITTED ABOVE, BE LIABLE TO YOU
+FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR
+CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE
+LIBRARY (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING
+RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A
+FAILURE OF THE LIBRARY TO OPERATE WITH ANY OTHER SOFTWARE), EVEN IF
+SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH
+DAMAGES.
+
+ END OF TERMS AND CONDITIONS
+
diff --git a/Makefile.gbase b/Makefile.gbase
new file mode 100644
index 0000000..ad03d36
--- /dev/null
+++ b/Makefile.gbase
@@ -0,0 +1,248 @@
+#
+# Copyright (C) 2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+
+# Makefile for libacml_mv library
+
+# What we're building, and where to find it.
+LIBRARY = libacml_mv.a
+
+TARGETS = $(LIBRARY)
+
+# Makefile setup
+include $(COMMONDEFS)
+
+VPATH = $(BUILD_BASE)/src:$(BUILD_BASE)/src/gas
+
+# Compiler options
+LCOPTS = $(STD_COMPILE_OPTS) $(STD_C_OPTS)
+LCDEFS = $(HOSTDEFS) $(TARGDEFS)
+LCINCS = -I$(BUILD_BASE)/inc
+
+# CFLAGS += -Wall -W -Wstrict-prototypes -Werror -fPIC -O2 $(DEBUG)
+
+ifeq ($(BUILD_ARCH), X8664)
+
+CFILES = \
+ acos.c \
+ acosf.c \
+ acosh.c \
+ acoshf.c \
+ asin.c \
+ asinf.c \
+ asinh.c \
+ asinhf.c \
+ atan2.c \
+ atan2f.c \
+ atan.c \
+ atanf.c \
+ atanh.c \
+ atanhf.c \
+ ceil.c \
+ ceilf.c \
+ cosh.c \
+ coshf.c \
+ exp_special.c \
+ finite.c \
+ finitef.c \
+ floor.c \
+ floorf.c \
+ frexp.c \
+ frexpf.c \
+ hypot.c \
+ hypotf.c \
+ ilogb.c \
+ ilogbf.c \
+ ldexp.c \
+ ldexpf.c \
+ libm_special.c \
+ llrint.c \
+ llrintf.c \
+ llround.c \
+ llroundf.c \
+ log1p.c \
+ log1pf.c \
+ logb.c \
+ logbf.c \
+ log_special.c \
+ lrint.c \
+ lrintf.c \
+ lround.c \
+ lroundf.c \
+ modf.c \
+ modff.c \
+ nan.c \
+ nanf.c \
+ nearbyintf.c \
+ nextafter.c \
+ nextafterf.c \
+ nexttoward.c \
+ nexttowardf.c \
+ pow_special.c \
+ remainder_piby2.c \
+ remainder_piby2d2f.c \
+ rint.c \
+ rintf.c \
+ roundf.c \
+ scalbln.c \
+ scalblnf.c \
+ scalbn.c \
+ scalbnf.c \
+ sincos_special.c \
+ sinh.c \
+ sinhf.c \
+ sqrt.c \
+ sqrtf.c \
+ tan.c \
+ tanf.c \
+ tanh.c \
+ tanhf.c
+
+ASFILES = \
+ cbrtf.S \
+ cbrt.S \
+ copysignf.S \
+ copysign.S \
+ cosf.S \
+ cos.S \
+ exp10f.S \
+ exp10.S \
+ exp2f.S \
+ exp2.S \
+ expf.S \
+ expm1f.S \
+ expm1.S \
+ exp.S \
+ fabsf.S \
+ fabs.S \
+ fdimf.S \
+ fdim.S \
+ fmaxf.S \
+ fmax.S \
+ fminf.S \
+ fmin.S \
+ fmodf.S \
+ fmod.S \
+ log10f.S \
+ log10.S \
+ log2f.S \
+ log2.S \
+ logf.S \
+ log.S \
+ nearbyint.S \
+ powf.S \
+ pow.S \
+ remainderf.S \
+ remainder.S \
+ round.S \
+ sincosf.S \
+ sincos.S \
+ sinf.S \
+ sin.S \
+ truncf.S \
+ trunc.S \
+ v4hcosl.S \
+ v4helpl.S \
+ v4hfrcpal.S \
+ v4hlog10l.S \
+ v4hlog2l.S \
+ v4hlogl.S \
+ v4hsinl.S \
+ vrd2cos.S \
+ vrd2exp.S \
+ vrd2log10.S \
+ vrd2log2.S \
+ vrd2log.S \
+ vrd2sincos.S \
+ vrd2sin.S \
+ vrd4cos.S \
+ vrd4exp.S \
+ vrd4frcpa.S \
+ vrd4log10.S \
+ vrd4log2.S \
+ vrd4log.S \
+ vrd4sin.S \
+ vrdacos.S \
+ vrdaexp.S \
+ vrdalog10.S \
+ vrdalog2.S \
+ vrdalogr.S \
+ vrdalog.S \
+ vrda_scaled_logr.S \
+ vrda_scaledshifted_logr.S \
+ vrdasincos.S \
+ vrdasin.S \
+ vrs4cosf.S \
+ vrs4expf.S \
+ vrs4log10f.S \
+ vrs4log2f.S \
+ vrs4logf.S \
+ vrs4powf.S \
+ vrs4powxf.S \
+ vrs4sincosf.S \
+ vrs4sinf.S \
+ vrs8expf.S \
+ vrs8log10f.S \
+ vrs8log2f.S \
+ vrs8logf.S \
+ vrsacosf.S \
+ vrsaexpf.S \
+ vrsalog10f.S \
+ vrsalog2f.S \
+ vrsalogf.S \
+ vrsapowf.S \
+ vrsapowxf.S \
+ vrsasincosf.S \
+ vrsasinf.S
+
+else
+
+# The special processing of the -lm option in the compiler driver should
+# be delayed until all of the options have been parsed. Until the
+# driver is cleaned up, it is important that processing be the same on
+# all architectures. Thus we add an empty 32-bit ACML vector math
+# library.
+
+dummy.c :
+ echo "void libacml_mv_placeholder() {}" > dummy.c
+
+CFILES = dummy.c
+LDIRT += dummy.c
+
+endif
+
+
+default:
+ $(MAKE) first
+ $(MAKE) $(TARGETS)
+ $(MAKE) last
+
+first :
+ifndef SKIP_DEP_BUILD
+ $(call submake,$(BUILD_AREA)/include)
+endif
+
+last : make_libdeps
+
+include $(COMMONRULES)
+
+$(LIBRARY): $(OBJECTS)
+ $(ar) cru $@ $^
+ $(ranlib) $@
+
diff --git a/acml_trace.cc b/acml_trace.cc
new file mode 100644
index 0000000..b5c967f
--- /dev/null
+++ b/acml_trace.cc
@@ -0,0 +1,86 @@
+// Copyright 2012 Google Inc. All Rights Reserved.
+// Author: martint@google.com (Martin Thuresson)
+
+#include "third_party/open64_libacml_mv/acml_trace.h"
+
+#include <float.h>
+#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+
+#include <functional>
+#include <string>
+
+#include "base/commandlineflags.h"
+#include "base/examine_stack.h"
+#include "base/googleinit.h"
+#include "base/init_google.h"
+#include "base/logging.h"
+#include "file/base/file.h"
+#include "file/base/helpers.h"
+#include "testing/base/public/benchmark.h"
+#include "testing/base/public/googletest.h"
+#include "third_party/absl/strings/cord.h"
+#include "third_party/open64_libacml_mv/libacml.h"
+#include "util/task/status.h"
+
+template<typename T>
+std::unique_ptr<std::vector<T>> InitTrace(
+ const char* filename,
+ std::function<T(CordReader* reader)> callback) {
+ std::unique_ptr<std::vector<T>> trace(new std::vector<T>);
+ Cord cord;
+ CHECK_OK(file::GetContents(filename, &cord, file::Defaults()));
+ CordReader reader(cord);
+
+ while (!reader.done()) {
+ trace->push_back(callback(&reader));
+ }
+
+ return trace;
+}
+
+// Read a trace file with doubles.
+std::unique_ptr<std::vector<double>> GetTraceDouble(const char *filename) {
+ std::function<double(CordReader* reader)> read_double =
+ [](CordReader* reader) {
+ double d;
+ CHECK_GE(reader->Available(), sizeof(d));
+ reader->ReadN(sizeof(d), reinterpret_cast<char*>(&d));
+ return d;
+ };
+ std::unique_ptr<std::vector<double>> trace(InitTrace<double>(filename,
+ read_double));
+ return trace;
+}
+
+// Read a trace file with pairs of doubles.
+std::unique_ptr<std::vector<std::pair<double, double>>> GetTraceDoublePair(
+ const char *filename) {
+ std::function<std::pair<double, double>(CordReader* reader)> read_double =
+ [](CordReader* reader) {
+ double d[2];
+ CHECK_GE(reader->Available(), sizeof(d));
+ reader->ReadN(sizeof(d), reinterpret_cast<char*>(&d));
+ return std::make_pair(d[0], d[1]);
+ };
+ std::unique_ptr<std::vector<std::pair<double, double>>> trace(
+ InitTrace<std::pair<double, double>>(filename, read_double));
+ return trace;
+}
+
+// Read a trace file with floats.
+std::unique_ptr<std::vector<float>> GetTraceFloat(const char *filename) {
+ std::function<float(CordReader* reader)> read_float =
+ [](CordReader* reader) {
+ float f;
+ const int bytes_to_read = min(sizeof(f), reader->Available());
+ reader->ReadN(bytes_to_read, reinterpret_cast<char*>(&f));
+ return f;
+ };
+ std::unique_ptr<std::vector<float>> trace(InitTrace<float>(filename,
+ read_float));
+ return trace;
+}
diff --git a/acml_trace.h b/acml_trace.h
new file mode 100644
index 0000000..65eda94
--- /dev/null
+++ b/acml_trace.h
@@ -0,0 +1,25 @@
+// Copyright 2012 and onwards Google Inc.
+// Author: martint@google.com (Martin Thuresson)
+
+#ifndef THIRD_PARTY_OPEN64_LIBACML_MV_ACML_TRACE_H__
+#define THIRD_PARTY_OPEN64_LIBACML_MV_ACML_TRACE_H__
+
+// Log files gathered from a complete run of rephil/docs. Contains the
+// arguments to all exp/log/pow call.
+#define BASE_TRACE_PATH "google3/third_party/open64_libacml_mv/testdata/"
+#define EXP_LOGFILE (BASE_TRACE_PATH "/exp.rephil_docs.builtin.baseline.trace")
+#define EXPF_LOGFILE (BASE_TRACE_PATH "/expf.fastmath_unittest.trace")
+#define LOG_LOGFILE (BASE_TRACE_PATH "/log.rephil_docs.builtin.baseline.trace")
+#define POW_LOGFILE (BASE_TRACE_PATH "/pow.rephil_docs.builtin.baseline.trace")
+
+#include <memory>
+#include <vector>
+
+std::unique_ptr<std::vector<std::pair<double, double> >> GetTraceDoublePair(
+ const char *filename);
+
+std::unique_ptr<std::vector<double>> GetTraceDouble(const char *filename);
+
+std::unique_ptr<std::vector<float>> GetTraceFloat(const char *filename);
+
+#endif // THIRD_PARTY_OPEN64_LIBACML_MV_ACML_TRACE_H__
diff --git a/acml_trace_benchmark.cc b/acml_trace_benchmark.cc
new file mode 100644
index 0000000..fb6acc4
--- /dev/null
+++ b/acml_trace_benchmark.cc
@@ -0,0 +1,272 @@
+// Copyright 2012 Google Inc. All Rights Reserved.
+// Author: martint@google.com (Martin Thuresson)
+
+#include "third_party/open64_libacml_mv/acml_trace.h"
+
+#include <float.h>
+#include <math.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+
+#include <memory>
+#include <vector>
+
+#include "base/commandlineflags.h"
+#include "base/examine_stack.h"
+#include "base/googleinit.h"
+#include "base/init_google.h"
+#include "base/logging.h"
+#include "file/base/file.h"
+#include "file/base/path.h"
+#include "testing/base/public/benchmark.h"
+#include "testing/base/public/googletest.h"
+#include "third_party/open64_libacml_mv/libacml.h"
+
+
+int main(int argc, char** argv) {
+ InitGoogle(argv[0], &argc, &argv, true);
+ RunSpecifiedBenchmarks();
+ return 0;
+}
+
+namespace {
+
+// Local typedefs to avoid repeating complex types all over the function.
+typedef std::unique_ptr<std::vector<double>> DoubleListPtr;
+typedef std::unique_ptr<std::vector<float>> FloatListPtr;
+typedef std::unique_ptr<std::vector<std::pair<double,
+ double>>> DoublePairListPtr;
+
+/////////////////////////
+// Benchmark log() calls.
+/////////////////////////
+
+// Measure time spent iterating through the values.
+static void BM_math_trace_read_log(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+ LOG_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ // Process trace.
+ double d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d += *iter;
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+// Benchmark acml_log().
+static void BM_math_trace_acmllog(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+ LOG_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ double d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d += acml_log(*iter);
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+// Benchmark log().
+static void BM_math_trace_log(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+ LOG_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ double d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d += log(*iter);
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+
+/////////////////////////
+// Benchmark exp() calls.
+/////////////////////////
+
+// Measure time spent iterating through the values.
+static void BM_math_trace_read_exp(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+ EXP_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ double d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d += *iter;
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+// Benchmark acml_exp().
+static void BM_math_trace_acmlexp(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+ EXP_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ double d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d += acml_exp(*iter);
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+// Benchmark exp().
+static void BM_math_trace_exp(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ DoubleListPtr trace(GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+ EXP_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ double d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d += exp(*iter);
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+/////////////////////////
+// Benchmark expf() calls.
+/////////////////////////
+
+// Measure time spent iterating through the values.
+static void BM_math_trace_read_expf(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ FloatListPtr trace(GetTraceFloat(file::JoinPath(FLAGS_test_srcdir,
+ EXPF_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ float d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d += *iter;
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+// Benchmark acml_exp().
+static void BM_math_trace_acmlexpf(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ FloatListPtr trace(GetTraceFloat(file::JoinPath(FLAGS_test_srcdir,
+ EXPF_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ float d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d += acml_expf(*iter);
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+// Benchmark exp().
+static void BM_math_trace_expf(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ FloatListPtr trace(GetTraceFloat(file::JoinPath(FLAGS_test_srcdir,
+ EXPF_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ float d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d += expf(*iter);
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+
+/////////////////////////
+// Benchmark pow() calls.
+/////////////////////////
+
+// Measure time spent iterating through the values.
+static void BM_math_trace_read_pow(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ DoublePairListPtr trace(GetTraceDoublePair(file::JoinPath(
+ FLAGS_test_srcdir, POW_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ double d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto itr = trace->begin(); itr != trace->end(); ++itr) {
+ d += (*itr).first + (*itr).second;
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+// Benchmark acml_pow().
+static void BM_math_trace_acmlpow(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ DoublePairListPtr trace(GetTraceDoublePair(file::JoinPath(
+ FLAGS_test_srcdir, POW_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ double d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto itr = trace->begin(); itr != trace->end(); ++itr) {
+ d += acml_pow((*itr).first,
+ (*itr).second);
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+// Benchmark pow().
+static void BM_math_trace_pow(int iters) {
+ // Read trace file into memory.
+ StopBenchmarkTiming();
+ DoublePairListPtr trace(GetTraceDoublePair(file::JoinPath(
+ FLAGS_test_srcdir, POW_LOGFILE).c_str()));
+ StartBenchmarkTiming();
+ double d = 0.0;
+ for (int iter = 0; iter < iters; ++iter) {
+ for (auto itr = trace->begin(); itr != trace->end(); ++itr) {
+ d += pow((*itr).first,
+ (*itr).second);
+ }
+ }
+ CHECK_NE(d, 0.0);
+}
+
+
+BENCHMARK(BM_math_trace_read_exp);
+BENCHMARK(BM_math_trace_acmlexp);
+BENCHMARK(BM_math_trace_exp);
+
+BENCHMARK(BM_math_trace_read_log);
+BENCHMARK(BM_math_trace_acmllog);
+BENCHMARK(BM_math_trace_log);
+
+BENCHMARK(BM_math_trace_read_pow);
+BENCHMARK(BM_math_trace_acmlpow);
+BENCHMARK(BM_math_trace_pow);
+
+BENCHMARK(BM_math_trace_read_expf);
+BENCHMARK(BM_math_trace_acmlexpf);
+BENCHMARK(BM_math_trace_expf);
+
+} // namespace
diff --git a/acml_trace_validate_test.cc b/acml_trace_validate_test.cc
new file mode 100644
index 0000000..9bd682c
--- /dev/null
+++ b/acml_trace_validate_test.cc
@@ -0,0 +1,114 @@
+// Copyright 2012 Google Inc. All Rights Reserved.
+// Author: martint@google.com (Martin Thuresson)
+
+#include "third_party/open64_libacml_mv/acml_trace.h"
+
+#include <math.h>
+#include <stdio.h>
+
+#include <memory>
+#include <vector>
+
+#include "base/commandlineflags.h"
+#include "base/examine_stack.h"
+#include "base/googleinit.h"
+#include "base/init_google.h"
+#include "base/logging.h"
+#include "file/base/file.h"
+#include "file/base/path.h"
+#include "testing/base/public/benchmark.h"
+#include "testing/base/public/googletest.h"
+#include "testing/base/public/gunit.h"
+#include "third_party/open64_libacml_mv/libacml.h"
+
+
+int main(int argc, char** argv) {
+ InitGoogle(argv[0], &argc, &argv, true);
+ RunSpecifiedBenchmarks();
+ return RUN_ALL_TESTS();
+}
+
+
+// Compare two doubles given a maximum unit of least precision (ULP).
+bool AlmostEqualDoubleUlps(double A, double B, int64 maxUlps) {
+ CHECK_EQ(sizeof(A), sizeof(maxUlps));
+ if (A == B)
+ return true;
+ int64 intDiff = std::abs(*(reinterpret_cast<int64*>(&A)) -
+ *(reinterpret_cast<int64*>(&B)));
+ return intDiff <= maxUlps;
+}
+
+// Compare two floats given a maximum unit of least precision (ULP).
+bool AlmostEqualFloatUlps(float A, float B, int32 maxUlps) {
+ CHECK_EQ(sizeof(A), sizeof(maxUlps));
+ if (A == B)
+ return true;
+ int32 intDiff = abs(*(reinterpret_cast<int32*>(&A)) -
+ *(reinterpret_cast<int32*>(&B)));
+ return intDiff <= maxUlps;
+}
+
+TEST(Case, LogTest) {
+ // Read trace file into memory.
+ std::unique_ptr<std::vector<double>> trace(
+ GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+ LOG_LOGFILE).c_str()));
+ double d1;
+ double d2;
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d1 = acml_log(*iter);
+ d2 = log(*iter);
+ // Make sure difference is at most 1 ULP.
+ EXPECT_TRUE(AlmostEqualDoubleUlps(d1, d2, 1));
+ }
+}
+
+TEST(Case, ExpTest) {
+ // Read trace file into memory.
+ std::unique_ptr<std::vector<double>> trace(
+ GetTraceDouble(file::JoinPath(FLAGS_test_srcdir,
+ EXP_LOGFILE).c_str()));
+ double d1;
+ double d2;
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d1 = acml_exp(*iter);
+ d2 = exp(*iter);
+ // Make sure difference is at most 1 ULP.
+ EXPECT_TRUE(AlmostEqualDoubleUlps(d1, d2, 1));
+ }
+}
+
+
+TEST(Case, ExpfTest) {
+ // Read trace file into memory.
+ std::unique_ptr<std::vector<float>> trace(
+ GetTraceFloat(file::JoinPath(FLAGS_test_srcdir,
+ EXPF_LOGFILE).c_str()));
+ float f1;
+ float f2;
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ f1 = acml_expf(*iter);
+ f2 = expf(*iter);
+ // Make sure difference is at most 1 ULP.
+ EXPECT_TRUE(AlmostEqualFloatUlps(f1, f2, 1));
+ }
+}
+
+
+TEST(Case, PowTest) {
+ // Read trace file into memory.
+ std::unique_ptr<std::vector<std::pair<double, double>>> trace(
+ GetTraceDoublePair(file::JoinPath(FLAGS_test_srcdir,
+ POW_LOGFILE).c_str()));
+ double d1;
+ double d2;
+ for (auto iter = trace->begin(); iter != trace->end(); ++iter) {
+ d1 = acml_pow((*iter).first,
+ (*iter).second);
+ d2 = pow((*iter).first,
+ (*iter).second);
+ // Make sure difference is at most 1 ULP.
+ EXPECT_TRUE(AlmostEqualDoubleUlps(d1, d2, 1));
+ }
+}
diff --git a/inc/acml_mv.h b/inc/acml_mv.h
new file mode 100644
index 0000000..49b7feb
--- /dev/null
+++ b/inc/acml_mv.h
@@ -0,0 +1,81 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+/*
+** A header file defining the C prototypes for the fast/vector libm functions
+*/
+
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+/*
+** The scalar routines.
+*/
+double fastexp(double);
+double fastlog(double);
+double fastlog10(double);
+double fastlog2(double);
+double fastpow(double,double);
+double fastsin(double);
+double fastcos(double);
+void fastsincos(double , double *, double *);
+
+float fastexpf(float );
+float fastlogf(float );
+float fastlog10f(float );
+float fastlog2f(float );
+float fastpowf(float,float);
+float fastcosf(float );
+float fastsinf(float );
+void fastsincosf(float, float *,float *);
+
+
+/*
+** The array routines.
+*/
+void vrda_exp(int, double *, double *);
+void vrda_log(int, double *, double *);
+void vrda_log10(int, double *, double *);
+void vrda_log2(int, double *, double *);
+void vrda_sin(int, double *, double *);
+void vrda_cos(int, double *, double *);
+void vrda_sincos(int, double *, double *, double *);
+
+void vrsa_expf(int, float *, float *);
+void vrsa_logf(int, float *, float *);
+void vrsa_log10f(int, float *, float *);
+void vrsa_log2f(int, float *, float *);
+void vrsa_powf(int n, float *x, float *y, float *z);
+void vrsa_powxf(int n, float *x, float y, float *z);
+void vrsa_sinf(int, float *, float *);
+void vrsa_cosf(int, float *, float *);
+void vrsa_sincosf(int, float *, float *, float *);
+
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/inc/acml_mv_m128.h b/inc/acml_mv_m128.h
new file mode 100644
index 0000000..c783fe3
--- /dev/null
+++ b/inc/acml_mv_m128.h
@@ -0,0 +1,103 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+/*
+** A header file defining the C prototypes for the fast/vector libm functions
+*/
+
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+/*
+** The scalar routines.
+*/
+double fastexp(double);
+double fastlog(double);
+double fastlog10(double);
+double fastlog2(double);
+double fastpow(double,double);
+double fastsin(double);
+double fastcos(double);
+void fastsincos(double , double *, double *);
+
+float fastexpf(float );
+float fastlogf(float );
+float fastlog10f(float );
+float fastlog2f(float );
+float fastpowf(float,float);
+float fastcosf(float );
+float fastsinf(float );
+void fastsincosf(float, float *,float *);
+
+/*
+** The single vector routines.
+*/
+__m128d __vrd2_log(__m128d);
+__m128d __vrd2_exp(__m128d);
+__m128d __vrd2_log10(__m128d);
+__m128d __vrd2_log2(__m128d);
+__m128d __vrd2_sin(__m128d);
+__m128d __vrd2_cos(__m128d);
+void __vrd2_sincos(__m128d, __m128d *, __m128d *);
+
+__m128 __vrs4_expf(__m128);
+__m128 __vrs4_logf(__m128);
+__m128 __vrs4_log10f(__m128);
+__m128 __vrs4_log2f(__m128);
+__m128 __vrs4_powf(__m128,__m128);
+__m128 __vrs4_powxf(__m128 x,float y);
+__m128 __vrs4_sinf(__m128);
+__m128 __vrs4_cosf(__m128);
+void __vrs4_sincosf(__m128, __m128 *, __m128 *);
+
+
+/*
+** The array routines.
+*/
+void vrda_exp(int, double *, double *);
+void vrda_log(int, double *, double *);
+void vrda_log10(int, double *, double *);
+void vrda_log2(int, double *, double *);
+void vrda_sin(int, double *, double *);
+void vrda_cos(int, double *, double *);
+void vrda_sincos(int, double *, double *, double *);
+
+void vrsa_expf(int, float *, float *);
+void vrsa_logf(int, float *, float *);
+void vrsa_log10f(int, float *, float *);
+void vrsa_log2f(int, float *, float *);
+void vrsa_powf(int n, float *x, float *y, float *z);
+void vrsa_powxf(int n, float *x, float y, float *z);
+void vrsa_sinf(int, float *, float *);
+void vrsa_cosf(int, float *, float *);
+void vrsa_sincosf(int, float *, float *, float *);
+
+
+
+#ifdef __cplusplus
+}
+#endif
diff --git a/inc/fn_macros.h b/inc/fn_macros.h
new file mode 100644
index 0000000..afc2f59
--- /dev/null
+++ b/inc/fn_macros.h
@@ -0,0 +1,47 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef __FN_MACROS_H__
+#define __FN_MACROS_H__
+
+#if defined(WINDOWS)
+#pragma warning( disable : 4985 )
+#define FN_PROTOTYPE(fn_name) acml_impl_##fn_name
+#else
+/* For Linux we prepend function names by a double underscore */
+#define ACML_CONCAT(x,y) x##y
+/* #define FN_PROTOTYPE(fn_name) concat(__,fn_name) */
+#define FN_PROTOTYPE(fn_name) ACML_CONCAT(acml_impl_,fn_name) /* commenting out previous line for build success, !!!!! REVISIT THIS SOON !!!!! */
+#endif
+
+
+#if defined(WINDOWS)
+#define weak_alias(name, aliasname) /* as nothing */
+#else
+/* Define ALIASNAME as a weak alias for NAME.
+ If weak aliases are not available, this defines a strong alias. */
+#define weak_alias(name, aliasname) /* _weak_alias (name, aliasname) */ /* !!!!! REVISIT THIS SOON !!!!! */
+#define _weak_alias(name, aliasname) extern __typeof (name) aliasname __attribute__ ((weak, alias (#name)));
+#endif
+
+#endif // __FN_MACROS_H__
diff --git a/inc/libm_amd.h b/inc/libm_amd.h
new file mode 100644
index 0000000..66cd46c
--- /dev/null
+++ b/inc/libm_amd.h
@@ -0,0 +1,225 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef LIBM_AMD_H_INCLUDED
+#define LIBM_AMD_H_INCLUDED 1
+
+#include <emmintrin.h>
+#include "acml_mv.h"
+#include "acml_mv_m128.h"
+
+#include "fn_macros.h"
+
+#ifdef __cplusplus
+extern "C" {
+#endif
+
+
+ double FN_PROTOTYPE(cbrt)(double x);
+ float FN_PROTOTYPE(cbrtf)(float x);
+
+ double FN_PROTOTYPE(fabs)(double x);
+ float FN_PROTOTYPE(fabsf)(float x);
+
+double FN_PROTOTYPE(acos)(double x);
+ float FN_PROTOTYPE(acosf)(float x);
+
+ double FN_PROTOTYPE(acosh)(double x);
+ float FN_PROTOTYPE(acoshf)(float x);
+
+ double FN_PROTOTYPE(asin)(double x);
+ float FN_PROTOTYPE(asinf)(float x);
+
+ double FN_PROTOTYPE( asinh)(double x);
+ float FN_PROTOTYPE(asinhf)(float x);
+
+ double FN_PROTOTYPE( atan)(double x);
+ float FN_PROTOTYPE(atanf)(float x);
+
+ double FN_PROTOTYPE( atanh)(double x);
+ float FN_PROTOTYPE(atanhf)(float x);
+
+ double FN_PROTOTYPE( atan2)(double x, double y);
+ float FN_PROTOTYPE(atan2f)(float x, float y);
+
+ double FN_PROTOTYPE( ceil)(double x);
+ float FN_PROTOTYPE(ceilf)(float x);
+
+
+ double FN_PROTOTYPE( cos)(double x);
+ float FN_PROTOTYPE(cosf)(float x);
+
+ double FN_PROTOTYPE( cosh)(double x);
+ float FN_PROTOTYPE(coshf)(float x);
+
+ double FN_PROTOTYPE( exp)(double x);
+ float FN_PROTOTYPE(expf)(float x);
+
+ double FN_PROTOTYPE( expm1)(double x);
+ float FN_PROTOTYPE(expm1f)(float x);
+
+ double FN_PROTOTYPE( exp2)(double x);
+ float FN_PROTOTYPE(exp2f)(float x);
+
+ double FN_PROTOTYPE( exp10)(double x);
+ float FN_PROTOTYPE(exp10f)(float x);
+
+
+ double FN_PROTOTYPE( fdim)(double x, double y);
+ float FN_PROTOTYPE(fdimf)(float x, float y);
+
+#ifdef WINDOWS
+ int FN_PROTOTYPE(finite)(double x);
+ int FN_PROTOTYPE(finitef)(float x);
+#else
+ int FN_PROTOTYPE(finite)(double x);
+ int FN_PROTOTYPE(finitef)(float x);
+#endif
+
+ double FN_PROTOTYPE( floor)(double x);
+ float FN_PROTOTYPE(floorf)(float x);
+
+ double FN_PROTOTYPE( fmax)(double x, double y);
+ float FN_PROTOTYPE(fmaxf)(float x, float y);
+
+ double FN_PROTOTYPE( fmin)(double x, double y);
+ float FN_PROTOTYPE(fminf)(float x, float y);
+
+ double FN_PROTOTYPE( fmod)(double x, double y);
+ float FN_PROTOTYPE(fmodf)(float x, float y);
+
+#ifdef WINDOWS
+ double FN_PROTOTYPE( hypot)(double x, double y);
+ float FN_PROTOTYPE(hypotf)(float x, float y);
+#else
+ double FN_PROTOTYPE( hypot)(double x, double y);
+ float FN_PROTOTYPE(hypotf)(float x, float y);
+#endif
+
+ float FN_PROTOTYPE(ldexpf)(float x, int exp);
+
+ double FN_PROTOTYPE(ldexp)(double x, int exp);
+
+ double FN_PROTOTYPE( log)(double x);
+ float FN_PROTOTYPE(logf)(float x);
+
+
+ float FN_PROTOTYPE(log2f)(float x);
+
+ double FN_PROTOTYPE( log10)(double x);
+ float FN_PROTOTYPE(log10f)(float x);
+
+
+ float FN_PROTOTYPE(log1pf)(float x);
+
+#ifdef WINDOWS
+ double FN_PROTOTYPE( logb)(double x);
+ float FN_PROTOTYPE(logbf)(float x);
+#else
+ double FN_PROTOTYPE( logb)(double x);
+ float FN_PROTOTYPE(logbf)(float x);
+#endif
+
+ double FN_PROTOTYPE( modf)(double x, double *iptr);
+ float FN_PROTOTYPE(modff)(float x, float *iptr);
+
+ double FN_PROTOTYPE( nextafter)(double x, double y);
+ float FN_PROTOTYPE(nextafterf)(float x, float y);
+
+ double FN_PROTOTYPE( pow)(double x, double y);
+ float FN_PROTOTYPE(powf)(float x, float y);
+
+double FN_PROTOTYPE( remainder)(double x, double y);
+ float FN_PROTOTYPE(remainderf)(float x, float y);
+
+ double FN_PROTOTYPE(sin)(double x);
+ float FN_PROTOTYPE(sinf)(float x);
+
+ void FN_PROTOTYPE(sincos)(double x, double *s, double *c);
+ void FN_PROTOTYPE(sincosf)(float x, float *s, float *c);
+
+ double FN_PROTOTYPE( sinh)(double x);
+ float FN_PROTOTYPE(sinhf)(float x);
+
+ double FN_PROTOTYPE( sqrt)(double x);
+ float FN_PROTOTYPE(sqrtf)(float x);
+
+ double FN_PROTOTYPE( tan)(double x);
+ float FN_PROTOTYPE(tanf)(float x);
+
+ double FN_PROTOTYPE( tanh)(double x);
+ float FN_PROTOTYPE(tanhf)(float x);
+
+ double FN_PROTOTYPE( trunc)(double x);
+ float FN_PROTOTYPE(truncf)(float x);
+
+ double FN_PROTOTYPE( log1p)(double x);
+ double FN_PROTOTYPE( log2)(double x);
+
+ double FN_PROTOTYPE(cosh)(double x);
+ float FN_PROTOTYPE(coshf)(float fx);
+
+ double FN_PROTOTYPE(frexp)(double value, int *exp);
+ float FN_PROTOTYPE(frexpf)(float value, int *exp);
+ int FN_PROTOTYPE(ilogb)(double x);
+ int FN_PROTOTYPE(ilogbf)(float x);
+
+ long long int FN_PROTOTYPE(llrint)(double x);
+ long long int FN_PROTOTYPE(llrintf)(float x);
+ long int FN_PROTOTYPE(lrint)(double x);
+ long int FN_PROTOTYPE(lrintf)(float x);
+ long int FN_PROTOTYPE(lround)(double d);
+ long int FN_PROTOTYPE(lroundf)(float f);
+ double FN_PROTOTYPE(nan)(const char *tagp);
+ float FN_PROTOTYPE(nanf)(const char *tagp);
+ float FN_PROTOTYPE(nearbyintf)(float x);
+ double FN_PROTOTYPE(nearbyint)(double x);
+ double FN_PROTOTYPE(nextafter)(double x, double y);
+ float FN_PROTOTYPE(nextafterf)(float x, float y);
+ double FN_PROTOTYPE(nexttoward)(double x, long double y);
+ float FN_PROTOTYPE(nexttowardf)(float x, long double y);
+ double FN_PROTOTYPE(rint)(double x);
+ float FN_PROTOTYPE(rintf)(float x);
+ float FN_PROTOTYPE(roundf)(float f);
+ double FN_PROTOTYPE(round)(double f);
+ double FN_PROTOTYPE(scalbln)(double x, long int n);
+ float FN_PROTOTYPE(scalblnf)(float x, long int n);
+ double FN_PROTOTYPE(scalbn)(double x, int n);
+ float FN_PROTOTYPE(scalbnf)(float x, int n);
+ long long int FN_PROTOTYPE(llroundf)(float f);
+ long long int FN_PROTOTYPE(llround)(double d);
+
+
+#ifdef WINDOWS
+ double FN_PROTOTYPE(copysign)(double x, double y);
+ float FN_PROTOTYPE(copysignf)(float x, float y);
+#else
+ double FN_PROTOTYPE(copysign)(double x, double y);
+ float FN_PROTOTYPE(copysignf)(float x, float y);
+#endif
+
+#ifdef __cplusplus
+}
+#endif
+
+#endif /* LIBM_AMD_H_INCLUDED */
diff --git a/inc/libm_errno_amd.h b/inc/libm_errno_amd.h
new file mode 100644
index 0000000..1e6b8b9
--- /dev/null
+++ b/inc/libm_errno_amd.h
@@ -0,0 +1,33 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef LIBM_ERRNO_AMD_H_INCLUDED
+#define LIBM_ERRNO_AMD_H_INCLUDED 1
+
+#include <stdio.h>
+#include <errno.h>
+#ifndef __set_errno
+#define __set_errno(x) errno = (x)
+#endif
+
+#endif /* LIBM_ERRNO_AMD_H_INCLUDED */
diff --git a/inc/libm_inlines_amd.h b/inc/libm_inlines_amd.h
new file mode 100644
index 0000000..a2e387a
--- /dev/null
+++ b/inc/libm_inlines_amd.h
@@ -0,0 +1,2188 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef LIBM_INLINES_AMD_H_INCLUDED
+#define LIBM_INLINES_AMD_H_INCLUDED 1
+
+#include "libm_util_amd.h"
+#include <math.h>
+
+#ifdef WINDOWS
+#define inline __inline
+#include "emmintrin.h"
+#endif
+
+/* Compile-time verification that type long is the same size
+ as type double (i.e. we are really on a 64-bit machine) */
+void check_long_against_double_size(int machine_is_64_bit[(sizeof(long long) == sizeof(double))?1:-1]);
+
+/* Set defines for inline functions calling other inlines */
+#if defined(USE_VAL_WITH_FLAGS) || defined(USE_VALF_WITH_FLAGS) || \
+ defined(USE_ZERO_WITH_FLAGS) || defined(USE_ZEROF_WITH_FLAGS) || \
+ defined(USE_NAN_WITH_FLAGS) || defined(USE_NANF_WITH_FLAGS) || \
+ defined(USE_INDEFINITE_WITH_FLAGS) || defined(USE_INDEFINITEF_WITH_FLAGS) || \
+ defined(USE_INFINITY_WITH_FLAGS) || defined(USE_INFINITYF_WITH_FLAGS) || \
+ defined(USE_SQRT_AMD_INLINE) || defined(USE_SQRTF_AMD_INLINE) || \
+ (defined(WINDOWS) && (defined(USE_HANDLE_ERROR) || defined(USE_HANDLE_ERRORF)))
+#undef USE_RAISE_FPSW_FLAGS
+#define USE_RAISE_FPSW_FLAGS 1
+#endif
+
+#if defined(USE_SPLITDOUBLE)
+/* Splits double x into exponent e and mantissa m, where 0.5 <= abs(m) < 1.0.
+ Assumes that x is not zero, denormal, infinity or NaN, but these conditions
+ are not checked */
+static inline void splitDouble(double x, int *e, double *m)
+{
+ unsigned long long ux, uy;
+ GET_BITS_DP64(x, ux);
+ uy = ux;
+ ux &= EXPBITS_DP64;
+ ux >>= EXPSHIFTBITS_DP64;
+ *e = (int)ux - EXPBIAS_DP64 + 1;
+ uy = (uy & (SIGNBIT_DP64 | MANTBITS_DP64)) | HALFEXPBITS_DP64;
+ PUT_BITS_DP64(uy, x);
+ *m = x;
+}
+#endif /* USE_SPLITDOUBLE */
+
+
+#if defined(USE_SPLITDOUBLE_2)
+/* Splits double x into exponent e and mantissa m, where 1.0 <= abs(m) < 4.0.
+ Assumes that x is not zero, denormal, infinity or NaN, but these conditions
+ are not checked. Also assumes EXPBIAS_DP is odd. With this
+ assumption, e will be even on exit. */
+static inline void splitDouble_2(double x, int *e, double *m)
+{
+ unsigned long long ux, vx;
+ GET_BITS_DP64(x, ux);
+ vx = ux;
+ ux &= EXPBITS_DP64;
+ ux >>= EXPSHIFTBITS_DP64;
+ if (ux & 1)
+ {
+ /* The exponent is odd */
+ vx = (vx & (SIGNBIT_DP64 | MANTBITS_DP64)) | ONEEXPBITS_DP64;
+ PUT_BITS_DP64(vx, x);
+ *m = x;
+ *e = ux - EXPBIAS_DP64;
+ }
+ else
+ {
+ /* The exponent is even */
+ vx = (vx & (SIGNBIT_DP64 | MANTBITS_DP64)) | TWOEXPBITS_DP64;
+ PUT_BITS_DP64(vx, x);
+ *m = x;
+ *e = ux - EXPBIAS_DP64 - 1;
+ }
+}
+#endif /* USE_SPLITDOUBLE_2 */
+
+
+#if defined(USE_SPLITFLOAT)
+/* Splits float x into exponent e and mantissa m, where 0.5 <= abs(m) < 1.0.
+ Assumes that x is not zero, denormal, infinity or NaN, but these conditions
+ are not checked */
+static inline void splitFloat(float x, int *e, float *m)
+{
+ unsigned int ux, uy;
+ GET_BITS_SP32(x, ux);
+ uy = ux;
+ ux &= EXPBITS_SP32;
+ ux >>= EXPSHIFTBITS_SP32;
+ *e = (int)ux - EXPBIAS_SP32 + 1;
+ uy = (uy & (SIGNBIT_SP32 | MANTBITS_SP32)) | HALFEXPBITS_SP32;
+ PUT_BITS_SP32(uy, x);
+ *m = x;
+}
+#endif /* USE_SPLITFLOAT */
+
+
+#if defined(USE_SCALEDOUBLE_1)
+/* Scales the double x by 2.0**n.
+ Assumes EMIN <= n <= EMAX, though this condition is not checked. */
+static inline double scaleDouble_1(double x, int n)
+{
+ double t;
+ /* Construct the number t = 2.0**n */
+ PUT_BITS_DP64(((long long)n + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t);
+ return x*t;
+}
+#endif /* USE_SCALEDOUBLE_1 */
+
+
+#if defined(USE_SCALEDOUBLE_2)
+/* Scales the double x by 2.0**n.
+ Assumes 2*EMIN <= n <= 2*EMAX, though this condition is not checked. */
+static inline double scaleDouble_2(double x, int n)
+{
+ double t1, t2;
+ int n1, n2;
+ n1 = n / 2;
+ n2 = n - n1;
+ /* Construct the numbers t1 = 2.0**n1 and t2 = 2.0**n2 */
+ PUT_BITS_DP64(((long long)n1 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t1);
+ PUT_BITS_DP64(((long long)n2 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t2);
+ return (x*t1)*t2;
+}
+#endif /* USE_SCALEDOUBLE_2 */
+
+
+#if defined(USE_SCALEDOUBLE_3)
+/* Scales the double x by 2.0**n.
+ Assumes 3*EMIN <= n <= 3*EMAX, though this condition is not checked. */
+static inline double scaleDouble_3(double x, int n)
+{
+ double t1, t2, t3;
+ int n1, n2, n3;
+ n1 = n / 3;
+ n2 = (n - n1) / 2;
+ n3 = n - n1 - n2;
+ /* Construct the numbers t1 = 2.0**n1, t2 = 2.0**n2 and t3 = 2.0**n3 */
+ PUT_BITS_DP64(((long long)n1 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t1);
+ PUT_BITS_DP64(((long long)n2 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t2);
+ PUT_BITS_DP64(((long long)n3 + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, t3);
+ return ((x*t1)*t2)*t3;
+}
+#endif /* USE_SCALEDOUBLE_3 */
+
+
+#if defined(USE_SCALEFLOAT_1)
+/* Scales the float x by 2.0**n.
+ Assumes EMIN <= n <= EMAX, though this condition is not checked. */
+static inline float scaleFloat_1(float x, int n)
+{
+ float t;
+ /* Construct the number t = 2.0**n */
+ PUT_BITS_SP32((n + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t);
+ return x*t;
+}
+#endif /* USE_SCALEFLOAT_1 */
+
+
+#if defined(USE_SCALEFLOAT_2)
+/* Scales the float x by 2.0**n.
+ Assumes 2*EMIN <= n <= 2*EMAX, though this condition is not checked. */
+static inline float scaleFloat_2(float x, int n)
+{
+ float t1, t2;
+ int n1, n2;
+ n1 = n / 2;
+ n2 = n - n1;
+ /* Construct the numbers t1 = 2.0**n1 and t2 = 2.0**n2 */
+ PUT_BITS_SP32((n1 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t1);
+ PUT_BITS_SP32((n2 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t2);
+ return (x*t1)*t2;
+}
+#endif /* USE_SCALEFLOAT_2 */
+
+
+#if defined(USE_SCALEFLOAT_3)
+/* Scales the float x by 2.0**n.
+ Assumes 3*EMIN <= n <= 3*EMAX, though this condition is not checked. */
+static inline float scaleFloat_3(float x, int n)
+{
+ float t1, t2, t3;
+ int n1, n2, n3;
+ n1 = n / 3;
+ n2 = (n - n1) / 2;
+ n3 = n - n1 - n2;
+ /* Construct the numbers t1 = 2.0**n1, t2 = 2.0**n2 and t3 = 2.0**n3 */
+ PUT_BITS_SP32((n1 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t1);
+ PUT_BITS_SP32((n2 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t2);
+ PUT_BITS_SP32((n3 + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, t3);
+ return ((x*t1)*t2)*t3;
+}
+#endif /* USE_SCALEFLOAT_3 */
+
+#if defined(USE_SETPRECISIONDOUBLE)
+unsigned int setPrecisionDouble(void)
+{
+ unsigned int cw, cwold = 0;
+ /* There is no precision control on Hammer */
+ return cwold;
+}
+#endif /* USE_SETPRECISIONDOUBLE */
+
+#if defined(USE_RESTOREPRECISION)
+void restorePrecision(unsigned int cwold)
+{
+#if defined(WINDOWS)
+ /* There is no precision control on Hammer */
+#elif defined(linux)
+ /* There is no precision control on Hammer */
+#else
+#error Unknown machine
+#endif
+ return;
+}
+#endif /* USE_RESTOREPRECISION */
+
+
+#if defined(USE_CLEAR_FPSW_FLAGS)
+/* Clears floating-point status flags. The argument should be
+ the bitwise or of the flags to be cleared, from the
+ list above, e.g.
+ clear_fpsw_flags(AMD_F_INEXACT | AMD_F_INVALID);
+ */
+static inline void clear_fpsw_flags(int flags)
+{
+#if defined(WINDOWS)
+ unsigned int cw = _mm_getcsr();
+ cw &= (~flags);
+ _mm_setcsr(cw);
+#elif defined(linux)
+ unsigned int cw;
+ /* Get the current floating-point control/status word */
+ asm volatile ("STMXCSR %0" : "=m" (cw));
+ cw &= (~flags);
+ asm volatile ("LDMXCSR %0" : : "m" (cw));
+#else
+#error Unknown machine
+#endif
+}
+#endif /* USE_CLEAR_FPSW_FLAGS */
+
+
+#if defined(USE_RAISE_FPSW_FLAGS)
+/* Raises floating-point status flags. The argument should be
+ the bitwise or of the flags to be raised, from the
+ list above, e.g.
+ raise_fpsw_flags(AMD_F_INEXACT | AMD_F_INVALID);
+ */
+static inline void raise_fpsw_flags(int flags)
+{
+#if defined(WINDOWS)
+ _mm_setcsr(_mm_getcsr() | flags);
+#elif defined(linux)
+ unsigned int cw;
+ /* Get the current floating-point control/status word */
+ asm volatile ("STMXCSR %0" : "=m" (cw));
+ cw |= flags;
+ asm volatile ("LDMXCSR %0" : : "m" (cw));
+#else
+#error Unknown machine
+#endif
+}
+#endif /* USE_RAISE_FPSW_FLAGS */
+
+
+#if defined(USE_GET_FPSW_INLINE)
+/* Return the current floating-point status word */
+static inline unsigned int get_fpsw_inline(void)
+{
+#if defined(WINDOWS)
+ return _mm_getcsr();
+#elif defined(linux)
+ unsigned int sw;
+ asm volatile ("STMXCSR %0" : "=m" (sw));
+ return sw;
+#else
+#error Unknown machine
+#endif
+}
+#endif /* USE_GET_FPSW_INLINE */
+
+#if defined(USE_SET_FPSW_INLINE)
+/* Set the floating-point status word */
+static inline void set_fpsw_inline(unsigned int sw)
+{
+#if defined(WINDOWS)
+ _mm_setcsr(sw);
+#elif defined(linux)
+ /* Set the current floating-point control/status word */
+ asm volatile ("LDMXCSR %0" : : "m" (sw));
+#else
+#error Unknown machine
+#endif
+}
+#endif /* USE_SET_FPSW_INLINE */
+
+#if defined(USE_CLEAR_FPSW_INLINE)
+/* Clear all exceptions from the floating-point status word */
+static inline void clear_fpsw_inline(void)
+{
+#if defined(WINDOWS)
+ unsigned int cw;
+ cw = _mm_getcsr();
+ cw &= ~(AMD_F_INEXACT | AMD_F_UNDERFLOW | AMD_F_OVERFLOW |
+ AMD_F_DIVBYZERO | AMD_F_INVALID);
+ _mm_setcsr(cw);
+#elif defined(linux)
+ unsigned int cw;
+ /* Get the current floating-point control/status word */
+ asm volatile ("STMXCSR %0" : "=m" (cw));
+ cw &= ~(AMD_F_INEXACT | AMD_F_UNDERFLOW | AMD_F_OVERFLOW |
+ AMD_F_DIVBYZERO | AMD_F_INVALID);
+ asm volatile ("LDMXCSR %0" : : "m" (cw));
+#else
+#error Unknown machine
+#endif
+}
+#endif /* USE_CLEAR_FPSW_INLINE */
+
+
+#if defined(USE_VAL_WITH_FLAGS)
+/* Returns a double value after raising the given flags,
+ e.g. val_with_flags(x, AMD_F_INEXACT);
+ */
+static inline double val_with_flags(double val, int flags)
+{
+ raise_fpsw_flags(flags);
+ return val;
+}
+#endif /* USE_VAL_WITH_FLAGS */
+
+#if defined(USE_VALF_WITH_FLAGS)
+/* Returns a float value after raising the given flags,
+ e.g. valf_with_flags(x, AMD_F_INEXACT);
+ */
+static inline float valf_with_flags(float val, int flags)
+{
+ raise_fpsw_flags(flags);
+ return val;
+}
+#endif /* USE_VALF_WITH_FLAGS */
+
+
+#if defined(USE_ZERO_WITH_FLAGS)
+/* Returns a double +zero after raising the given flags,
+ e.g. zero_with_flags(AMD_F_INEXACT | AMD_F_INVALID);
+ */
+static inline double zero_with_flags(int flags)
+{
+ raise_fpsw_flags(flags);
+ return 0.0;
+}
+#endif /* USE_ZERO_WITH_FLAGS */
+
+
+#if defined(USE_ZEROF_WITH_FLAGS)
+/* Returns a float +zero after raising the given flags,
+ e.g. zerof_with_flags(AMD_F_INEXACT | AMD_F_INVALID);
+ */
+static inline float zerof_with_flags(int flags)
+{
+ raise_fpsw_flags(flags);
+ return 0.0F;
+}
+#endif /* USE_ZEROF_WITH_FLAGS */
+
+
+#if defined(USE_NAN_WITH_FLAGS)
+/* Returns a double quiet +nan after raising the given flags,
+ e.g. nan_with_flags(AMD_F_INVALID);
+*/
+static inline double nan_with_flags(int flags)
+{
+ double z;
+ raise_fpsw_flags(flags);
+ PUT_BITS_DP64(0x7ff8000000000000, z);
+ return z;
+}
+#endif /* USE_NAN_WITH_FLAGS */
+
+#if defined(USE_NANF_WITH_FLAGS)
+/* Returns a float quiet +nan after raising the given flags,
+ e.g. nanf_with_flags(AMD_F_INVALID);
+*/
+static inline float nanf_with_flags(int flags)
+{
+ float z;
+ raise_fpsw_flags(flags);
+ PUT_BITS_SP32(0x7fc00000, z);
+ return z;
+}
+#endif /* USE_NANF_WITH_FLAGS */
+
+
+#if defined(USE_INDEFINITE_WITH_FLAGS)
+/* Returns a double indefinite after raising the given flags,
+ e.g. indefinite_with_flags(AMD_F_INVALID);
+*/
+static inline double indefinite_with_flags(int flags)
+{
+ double z;
+ raise_fpsw_flags(flags);
+ PUT_BITS_DP64(0xfff8000000000000, z);
+ return z;
+}
+#endif /* USE_INDEFINITE_WITH_FLAGS */
+
+#if defined(USE_INDEFINITEF_WITH_FLAGS)
+/* Returns a float quiet +indefinite after raising the given flags,
+ e.g. indefinitef_with_flags(AMD_F_INVALID);
+*/
+static inline float indefinitef_with_flags(int flags)
+{
+ float z;
+ raise_fpsw_flags(flags);
+ PUT_BITS_SP32(0xffc00000, z);
+ return z;
+}
+#endif /* USE_INDEFINITEF_WITH_FLAGS */
+
+
+#ifdef USE_INFINITY_WITH_FLAGS
+/* Returns a positive double infinity after raising the given flags,
+ e.g. infinity_with_flags(AMD_F_OVERFLOW);
+*/
+static inline double infinity_with_flags(int flags)
+{
+ double z;
+ raise_fpsw_flags(flags);
+ PUT_BITS_DP64((unsigned long long)(BIASEDEMAX_DP64 + 1) << EXPSHIFTBITS_DP64, z);
+ return z;
+}
+#endif /* USE_INFINITY_WITH_FLAGS */
+
+#ifdef USE_INFINITYF_WITH_FLAGS
+/* Returns a positive float infinity after raising the given flags,
+ e.g. infinityf_with_flags(AMD_F_OVERFLOW);
+*/
+static inline float infinityf_with_flags(int flags)
+{
+ float z;
+ raise_fpsw_flags(flags);
+ PUT_BITS_SP32((BIASEDEMAX_SP32 + 1) << EXPSHIFTBITS_SP32, z);
+ return z;
+}
+#endif /* USE_INFINITYF_WITH_FLAGS */
+
+
+#if defined(USE_SPLITEXP)
+/* Compute the values m, z1, and z2 such that base**x = 2**m * (z1 + z2).
+ Small arguments abs(x) < 1/(16*ln(base)) and extreme arguments
+ abs(x) > large/(ln(base)) (where large is the largest representable
+ floating point number) should be handled separately instead of calling
+ this function. This function is called by exp_amd, exp2_amd, exp10_amd,
+ cosh_amd and sinh_amd. */
+static inline void splitexp(double x, double logbase,
+ double thirtytwo_by_logbaseof2,
+ double logbaseof2_by_32_lead,
+ double logbaseof2_by_32_trail,
+ int *m, double *z1, double *z2)
+{
+ double q, r, r1, r2, f1, f2;
+ int n, j;
+
+/* Arrays two_to_jby32_lead_table and two_to_jby32_trail_table contain
+ leading and trailing parts respectively of precomputed
+ values of pow(2.0,j/32.0), for j = 0, 1, ..., 31.
+ two_to_jby32_lead_table contains the first 25 bits of precision,
+ and two_to_jby32_trail_table contains a further 53 bits precision. */
+
+ static const double two_to_jby32_lead_table[32] = {
+ 1.00000000000000000000e+00, /* 0x3ff0000000000000 */
+ 1.02189713716506958008e+00, /* 0x3ff059b0d0000000 */
+ 1.04427373409271240234e+00, /* 0x3ff0b55860000000 */
+ 1.06714040040969848633e+00, /* 0x3ff11301d0000000 */
+ 1.09050768613815307617e+00, /* 0x3ff172b830000000 */
+ 1.11438673734664916992e+00, /* 0x3ff1d48730000000 */
+ 1.13878858089447021484e+00, /* 0x3ff2387a60000000 */
+ 1.16372483968734741211e+00, /* 0x3ff29e9df0000000 */
+ 1.18920707702636718750e+00, /* 0x3ff306fe00000000 */
+ 1.21524733304977416992e+00, /* 0x3ff371a730000000 */
+ 1.24185776710510253906e+00, /* 0x3ff3dea640000000 */
+ 1.26905095577239990234e+00, /* 0x3ff44e0860000000 */
+ 1.29683953523635864258e+00, /* 0x3ff4bfdad0000000 */
+ 1.32523661851882934570e+00, /* 0x3ff5342b50000000 */
+ 1.35425549745559692383e+00, /* 0x3ff5ab07d0000000 */
+ 1.38390988111495971680e+00, /* 0x3ff6247eb0000000 */
+ 1.41421353816986083984e+00, /* 0x3ff6a09e60000000 */
+ 1.44518077373504638672e+00, /* 0x3ff71f75e0000000 */
+ 1.47682613134384155273e+00, /* 0x3ff7a11470000000 */
+ 1.50916439294815063477e+00, /* 0x3ff8258990000000 */
+ 1.54221081733703613281e+00, /* 0x3ff8ace540000000 */
+ 1.57598084211349487305e+00, /* 0x3ff93737b0000000 */
+ 1.61049032211303710938e+00, /* 0x3ff9c49180000000 */
+ 1.64575546979904174805e+00, /* 0x3ffa5503b0000000 */
+ 1.68179279565811157227e+00, /* 0x3ffae89f90000000 */
+ 1.71861928701400756836e+00, /* 0x3ffb7f76f0000000 */
+ 1.75625211000442504883e+00, /* 0x3ffc199bd0000000 */
+ 1.79470902681350708008e+00, /* 0x3ffcb720d0000000 */
+ 1.83400803804397583008e+00, /* 0x3ffd5818d0000000 */
+ 1.87416762113571166992e+00, /* 0x3ffdfc9730000000 */
+ 1.91520655155181884766e+00, /* 0x3ffea4afa0000000 */
+ 1.95714408159255981445e+00}; /* 0x3fff507650000000 */
+
+ static const double two_to_jby32_trail_table[32] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.14890470981563546737e-08, /* 0x3e48ac2ba1d73e2a */
+ 4.83347014379782142328e-08, /* 0x3e69f3121ec53172 */
+ 2.67125131841396124714e-10, /* 0x3df25b50a4ebbf1b */
+ 4.65271045830351350190e-08, /* 0x3e68faa2f5b9bef9 */
+ 5.24924336638693782574e-09, /* 0x3e368b9aa7805b80 */
+ 5.38622214388600821910e-08, /* 0x3e6ceac470cd83f6 */
+ 1.90902301017041969782e-08, /* 0x3e547f7b84b09745 */
+ 3.79763538792174980894e-08, /* 0x3e64636e2a5bd1ab */
+ 2.69306947081946450986e-08, /* 0x3e5ceaa72a9c5154 */
+ 4.49683815095311756138e-08, /* 0x3e682468446b6824 */
+ 1.41933332021066904914e-09, /* 0x3e18624b40c4dbd0 */
+ 1.94146510233556266402e-08, /* 0x3e54d8a89c750e5e */
+ 2.46409119489264118569e-08, /* 0x3e5a753e077c2a0f */
+ 4.94812958044698886494e-08, /* 0x3e6a90a852b19260 */
+ 8.48872238075784476136e-10, /* 0x3e0d2ac258f87d03 */
+ 2.42032342089579394887e-08, /* 0x3e59fcef32422cbf */
+ 3.32420002333182569170e-08, /* 0x3e61d8bee7ba46e2 */
+ 1.45956577586525322754e-08, /* 0x3e4f580c36bea881 */
+ 3.46452721050003920866e-08, /* 0x3e62999c25159f11 */
+ 8.07090469079979051284e-09, /* 0x3e415506dadd3e2a */
+ 2.99439161340839520436e-09, /* 0x3e29b8bc9e8a0388 */
+ 9.83621719880452147153e-09, /* 0x3e451f8480e3e236 */
+ 8.35492309647188080486e-09, /* 0x3e41f12ae45a1224 */
+ 3.48493175137966283582e-08, /* 0x3e62b5a75abd0e6a */
+ 1.11084703472699692902e-08, /* 0x3e47daf237553d84 */
+ 5.03688744342840346564e-08, /* 0x3e6b0aa538444196 */
+ 4.81896001063495806249e-08, /* 0x3e69df20d22a0798 */
+ 4.83653666334089557746e-08, /* 0x3e69f7490e4bb40b */
+ 1.29745882314081237628e-08, /* 0x3e4bdcdaf5cb4656 */
+ 9.84532844621636118964e-09, /* 0x3e452486cc2c7b9d */
+ 4.25828404545651943883e-08}; /* 0x3e66dc8a80ce9f09 */
+
+ /*
+ Step 1. Reduce the argument.
+
+ To perform argument reduction, we find the integer n such that
+ x = n * logbaseof2/32 + remainder, |remainder| <= logbaseof2/64.
+ n is defined by round-to-nearest-integer( x*32/logbaseof2 ) and
+ remainder by x - n*logbaseof2/32. The calculation of n is
+ straightforward whereas the computation of x - n*logbaseof2/32
+ must be carried out carefully.
+ logbaseof2/32 is so represented in two pieces that
+ (1) logbaseof2/32 is known to extra precision, (2) the product
+ of n and the leading piece is a model number and is hence
+ calculated without error, and (3) the subtraction of the value
+ obtained in (2) from x is a model number and is hence again
+ obtained without error.
+ */
+
+ r = x * thirtytwo_by_logbaseof2;
+ /* Set n = nearest integer to r */
+ /* This is faster on Hammer */
+ if (r > 0)
+ n = (int)(r + 0.5);
+ else
+ n = (int)(r - 0.5);
+
+ r1 = x - n * logbaseof2_by_32_lead;
+ r2 = - n * logbaseof2_by_32_trail;
+
+ /* Set j = n mod 32: 5 mod 32 = 5, -5 mod 32 = 27, etc. */
+ /* j = n % 32;
+ if (j < 0) j += 32; */
+ j = n & 0x0000001f;
+
+ f1 = two_to_jby32_lead_table[j];
+ f2 = two_to_jby32_trail_table[j];
+
+ *m = (n - j) / 32;
+
+ /* Step 2. The following is the core approximation. We approximate
+ exp(r1+r2)-1 by a polynomial. */
+
+ r1 *= logbase; r2 *= logbase;
+
+ r = r1 + r2;
+ q = r1 + (r2 +
+ r*r*( 5.00000000000000008883e-01 +
+ r*( 1.66666666665260878863e-01 +
+ r*( 4.16666666662260795726e-02 +
+ r*( 8.33336798434219616221e-03 +
+ r*( 1.38889490863777199667e-03 ))))));
+
+ /* Step 3. Function value reconstruction.
+ We now reconstruct the exponential of the input argument
+ so that exp(x) = 2**m * (z1 + z2).
+ The order of the computation below must be strictly observed. */
+
+ *z1 = f1;
+ *z2 = f2 + ((f1 + f2) * q);
+}
+#endif /* USE_SPLITEXP */
+
+
+#if defined(USE_SPLITEXPF)
+/* Compute the values m, z1, and z2 such that base**x = 2**m * (z1 + z2).
+ Small arguments abs(x) < 1/(16*ln(base)) and extreme arguments
+ abs(x) > large/(ln(base)) (where large is the largest representable
+ floating point number) should be handled separately instead of calling
+ this function. This function is called by exp_amd, exp2_amd, exp10_amd,
+ cosh_amd and sinh_amd. */
+static inline void splitexpf(float x, float logbase,
+ float thirtytwo_by_logbaseof2,
+ float logbaseof2_by_32_lead,
+ float logbaseof2_by_32_trail,
+ int *m, float *z1, float *z2)
+{
+ float q, r, r1, r2, f1, f2;
+ int n, j;
+
+/* Arrays two_to_jby32_lead_table and two_to_jby32_trail_table contain
+ leading and trailing parts respectively of precomputed
+ values of pow(2.0,j/32.0), for j = 0, 1, ..., 31.
+ two_to_jby32_lead_table contains the first 10 bits of precision,
+ and two_to_jby32_trail_table contains a further 24 bits precision. */
+
+ static const float two_to_jby32_lead_table[32] = {
+ 1.0000000000E+00F, /* 0x3F800000 */
+ 1.0214843750E+00F, /* 0x3F82C000 */
+ 1.0429687500E+00F, /* 0x3F858000 */
+ 1.0664062500E+00F, /* 0x3F888000 */
+ 1.0898437500E+00F, /* 0x3F8B8000 */
+ 1.1132812500E+00F, /* 0x3F8E8000 */
+ 1.1386718750E+00F, /* 0x3F91C000 */
+ 1.1621093750E+00F, /* 0x3F94C000 */
+ 1.1875000000E+00F, /* 0x3F980000 */
+ 1.2148437500E+00F, /* 0x3F9B8000 */
+ 1.2402343750E+00F, /* 0x3F9EC000 */
+ 1.2675781250E+00F, /* 0x3FA24000 */
+ 1.2949218750E+00F, /* 0x3FA5C000 */
+ 1.3242187500E+00F, /* 0x3FA98000 */
+ 1.3535156250E+00F, /* 0x3FAD4000 */
+ 1.3828125000E+00F, /* 0x3FB10000 */
+ 1.4140625000E+00F, /* 0x3FB50000 */
+ 1.4433593750E+00F, /* 0x3FB8C000 */
+ 1.4765625000E+00F, /* 0x3FBD0000 */
+ 1.5078125000E+00F, /* 0x3FC10000 */
+ 1.5410156250E+00F, /* 0x3FC54000 */
+ 1.5742187500E+00F, /* 0x3FC98000 */
+ 1.6093750000E+00F, /* 0x3FCE0000 */
+ 1.6445312500E+00F, /* 0x3FD28000 */
+ 1.6816406250E+00F, /* 0x3FD74000 */
+ 1.7167968750E+00F, /* 0x3FDBC000 */
+ 1.7558593750E+00F, /* 0x3FE0C000 */
+ 1.7929687500E+00F, /* 0x3FE58000 */
+ 1.8339843750E+00F, /* 0x3FEAC000 */
+ 1.8730468750E+00F, /* 0x3FEFC000 */
+ 1.9140625000E+00F, /* 0x3FF50000 */
+ 1.9570312500E+00F}; /* 0x3FFA8000 */
+
+ static const float two_to_jby32_trail_table[32] = {
+ 0.0000000000E+00F, /* 0x00000000 */
+ 4.1277357377E-04F, /* 0x39D86988 */
+ 1.3050324051E-03F, /* 0x3AAB0D9F */
+ 7.3415064253E-04F, /* 0x3A407404 */
+ 6.6398258787E-04F, /* 0x3A2E0F1E */
+ 1.1054925853E-03F, /* 0x3A90E62D */
+ 1.1675967835E-04F, /* 0x38F4DCE0 */
+ 1.6154836630E-03F, /* 0x3AD3BEA3 */
+ 1.7071149778E-03F, /* 0x3ADFC146 */
+ 4.0360994171E-04F, /* 0x39D39B9C */
+ 1.6234370414E-03F, /* 0x3AD4C982 */
+ 1.4728321694E-03F, /* 0x3AC10C0C */
+ 1.9176795613E-03F, /* 0x3AFB5AA6 */
+ 1.0178930825E-03F, /* 0x3A856AD3 */
+ 7.3992193211E-04F, /* 0x3A41F752 */
+ 1.0973819299E-03F, /* 0x3A8FD607 */
+ 1.5106226783E-04F, /* 0x391E6678 */
+ 1.8214319134E-03F, /* 0x3AEEBD1D */
+ 2.6364589576E-04F, /* 0x398A39F4 */
+ 1.3519275235E-03F, /* 0x3AB13329 */
+ 1.1952003697E-03F, /* 0x3A9CA845 */
+ 1.7620950239E-03F, /* 0x3AE6F619 */
+ 1.1153318919E-03F, /* 0x3A923054 */
+ 1.2242280645E-03F, /* 0x3AA07647 */
+ 1.5220546629E-04F, /* 0x391F9958 */
+ 1.8224230735E-03F, /* 0x3AEEDE5F */
+ 3.9278529584E-04F, /* 0x39CDEEC0 */
+ 1.7403248930E-03F, /* 0x3AE41B9D */
+ 2.3711356334E-05F, /* 0x37C6E7C0 */
+ 1.1207590578E-03F, /* 0x3A92E66F */
+ 1.1440613307E-03F, /* 0x3A95F454 */
+ 1.1287408415E-04F}; /* 0x38ECB6D0 */
+
+ /*
+ Step 1. Reduce the argument.
+
+ To perform argument reduction, we find the integer n such that
+ x = n * logbaseof2/32 + remainder, |remainder| <= logbaseof2/64.
+ n is defined by round-to-nearest-integer( x*32/logbaseof2 ) and
+ remainder by x - n*logbaseof2/32. The calculation of n is
+ straightforward whereas the computation of x - n*logbaseof2/32
+ must be carried out carefully.
+ logbaseof2/32 is so represented in two pieces that
+ (1) logbaseof2/32 is known to extra precision, (2) the product
+ of n and the leading piece is a model number and is hence
+ calculated without error, and (3) the subtraction of the value
+ obtained in (2) from x is a model number and is hence again
+ obtained without error.
+ */
+
+ r = x * thirtytwo_by_logbaseof2;
+ /* Set n = nearest integer to r */
+ /* This is faster on Hammer */
+ if (r > 0)
+ n = (int)(r + 0.5F);
+ else
+ n = (int)(r - 0.5F);
+
+ r1 = x - n * logbaseof2_by_32_lead;
+ r2 = - n * logbaseof2_by_32_trail;
+
+ /* Set j = n mod 32: 5 mod 32 = 5, -5 mod 32 = 27, etc. */
+ /* j = n % 32;
+ if (j < 0) j += 32; */
+ j = n & 0x0000001f;
+
+ f1 = two_to_jby32_lead_table[j];
+ f2 = two_to_jby32_trail_table[j];
+
+ *m = (n - j) / 32;
+
+ /* Step 2. The following is the core approximation. We approximate
+ exp(r1+r2)-1 by a polynomial. */
+
+ r1 *= logbase; r2 *= logbase;
+
+ r = r1 + r2;
+ q = r1 + (r2 +
+ r*r*( 5.00000000000000008883e-01F +
+ r*( 1.66666666665260878863e-01F )));
+
+ /* Step 3. Function value reconstruction.
+ We now reconstruct the exponential of the input argument
+ so that exp(x) = 2**m * (z1 + z2).
+ The order of the computation below must be strictly observed. */
+
+ *z1 = f1;
+ *z2 = f2 + ((f1 + f2) * q);
+}
+#endif /* SPLITEXPF */
+
+
+#if defined(USE_SCALEUPDOUBLE1024)
+/* Scales up a double (normal or denormal) whose bit pattern is given
+ as ux by 2**1024. There are no checks that the input number is
+ scalable by that amount. */
+static inline void scaleUpDouble1024(unsigned long long ux, unsigned long long *ur)
+{
+ unsigned long long uy;
+ double y;
+
+ if ((ux & EXPBITS_DP64) == 0)
+ {
+ /* ux is denormalised */
+ PUT_BITS_DP64(ux | 0x4010000000000000, y);
+ if (ux & SIGNBIT_DP64)
+ y += 4.0;
+ else
+ y -= 4.0;
+ GET_BITS_DP64(y, uy);
+ }
+ else
+ /* ux is normal */
+ uy = ux + 0x4000000000000000;
+
+ *ur = uy;
+ return;
+}
+
+#endif /* SCALEUPDOUBLE1024 */
+
+
+#if defined(USE_SCALEDOWNDOUBLE)
+/* Scales down a double whose bit pattern is given as ux by 2**k.
+ There are no checks that the input number is scalable by that amount. */
+static inline void scaleDownDouble(unsigned long long ux, int k,
+ unsigned long long *ur)
+{
+ unsigned long long uy, uk, ax, xsign;
+ int n, shift;
+ xsign = ux & SIGNBIT_DP64;
+ ax = ux & ~SIGNBIT_DP64;
+ n = (int)((ax & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - k;
+ if (n > 0)
+ {
+ uk = (unsigned long long)n << EXPSHIFTBITS_DP64;
+ uy = (ax & ~EXPBITS_DP64) | uk;
+ }
+ else
+ {
+ uy = (ax & ~EXPBITS_DP64) | 0x0010000000000000;
+ shift = (1 - n);
+ if (shift > MANTLENGTH_DP64 + 1)
+ /* Sigh. Shifting works mod 64 so be careful not to shift too much */
+ uy = 0;
+ else
+ {
+ /* Make sure we round the result */
+ uy >>= shift - 1;
+ uy = (uy >> 1) + (uy & 1);
+ }
+ }
+ *ur = uy | xsign;
+}
+
+#endif /* SCALEDOWNDOUBLE */
+
+
+#if defined(USE_SCALEUPFLOAT128)
+/* Scales up a float (normal or denormal) whose bit pattern is given
+ as ux by 2**128. There are no checks that the input number is
+ scalable by that amount. */
+static inline void scaleUpFloat128(unsigned int ux, unsigned int *ur)
+{
+ unsigned int uy;
+ float y;
+
+ if ((ux & EXPBITS_SP32) == 0)
+ {
+ /* ux is denormalised */
+ PUT_BITS_SP32(ux | 0x40800000, y);
+ /* Compensate for the implicit bit just added */
+ if (ux & SIGNBIT_SP32)
+ y += 4.0F;
+ else
+ y -= 4.0F;
+ GET_BITS_SP32(y, uy);
+ }
+ else
+ /* ux is normal */
+ uy = ux + 0x40000000;
+ *ur = uy;
+}
+#endif /* SCALEUPFLOAT128 */
+
+
+#if defined(USE_SCALEDOWNFLOAT)
+/* Scales down a float whose bit pattern is given as ux by 2**k.
+ There are no checks that the input number is scalable by that amount. */
+static inline void scaleDownFloat(unsigned int ux, int k,
+ unsigned int *ur)
+{
+ unsigned int uy, uk, ax, xsign;
+ int n, shift;
+
+ xsign = ux & SIGNBIT_SP32;
+ ax = ux & ~SIGNBIT_SP32;
+ n = ((ax & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - k;
+ if (n > 0)
+ {
+ uk = (unsigned int)n << EXPSHIFTBITS_SP32;
+ uy = (ax & ~EXPBITS_SP32) | uk;
+ }
+ else
+ {
+ uy = (ax & ~EXPBITS_SP32) | 0x00800000;
+ shift = (1 - n);
+ if (shift > MANTLENGTH_SP32 + 1)
+ /* Sigh. Shifting works mod 32 so be careful not to shift too much */
+ uy = 0;
+ else
+ {
+ /* Make sure we round the result */
+ uy >>= shift - 1;
+ uy = (uy >> 1) + (uy & 1);
+ }
+ }
+ *ur = uy | xsign;
+}
+#endif /* SCALEDOWNFLOAT */
+
+
+#if defined(USE_SQRT_AMD_INLINE)
+static inline double sqrt_amd_inline(double x)
+{
+ /*
+ Computes the square root of x.
+
+ The calculation is carried out in three steps.
+
+ Step 1. Reduction.
+ The input argument is scaled to the interval [1, 4) by
+ computing
+ x = 2^e * y, where y in [1,4).
+ Furthermore y is decomposed as y = c + t where
+ c = 1 + j/32, j = 0,1,..,96; and |t| <= 1/64.
+
+ Step 2. Approximation.
+ An approximation q = sqrt(1 + (t/c)) - 1 is obtained
+ from a basic series expansion using precomputed values
+ stored in rt_jby32_lead_table_dbl and rt_jby32_trail_table_dbl.
+
+ Step 3. Reconstruction.
+ The value of sqrt(x) is reconstructed via
+ sqrt(x) = 2^(e/2) * sqrt(y)
+ = 2^(e/2) * sqrt(c) * sqrt(y/c)
+ = 2^(e/2) * sqrt(c) * sqrt(1 + t/c)
+ = 2^(e/2) * [ sqrt(c) + sqrt(c)*q ]
+ */
+
+ unsigned long long ux, ax, u;
+ double r1, r2, c, y, p, q, r, twop, z, rtc, rtc_lead, rtc_trail;
+ int e, denorm = 0, index;
+
+/* Arrays rt_jby32_lead_table_dbl and rt_jby32_trail_table_dbl contain
+ leading and trailing parts respectively of precomputed
+ values of sqrt(j/32), for j = 32, 33, ..., 128.
+ rt_jby32_lead_table_dbl contains the first 21 bits of precision,
+ and rt_jby32_trail_table_dbl contains a further 53 bits precision. */
+
+ static const double rt_jby32_lead_table_dbl[97] = {
+ 1.00000000000000000000e+00, /* 0x3ff0000000000000 */
+ 1.01550388336181640625e+00, /* 0x3ff03f8100000000 */
+ 1.03077602386474609375e+00, /* 0x3ff07e0f00000000 */
+ 1.04582500457763671875e+00, /* 0x3ff0bbb300000000 */
+ 1.06065940856933593750e+00, /* 0x3ff0f87600000000 */
+ 1.07528972625732421875e+00, /* 0x3ff1346300000000 */
+ 1.08972454071044921875e+00, /* 0x3ff16f8300000000 */
+ 1.10396957397460937500e+00, /* 0x3ff1a9dc00000000 */
+ 1.11803340911865234375e+00, /* 0x3ff1e37700000000 */
+ 1.13192272186279296875e+00, /* 0x3ff21c5b00000000 */
+ 1.14564323425292968750e+00, /* 0x3ff2548e00000000 */
+ 1.15920162200927734375e+00, /* 0x3ff28c1700000000 */
+ 1.17260360717773437500e+00, /* 0x3ff2c2fc00000000 */
+ 1.18585395812988281250e+00, /* 0x3ff2f94200000000 */
+ 1.19895744323730468750e+00, /* 0x3ff32eee00000000 */
+ 1.21191978454589843750e+00, /* 0x3ff3640600000000 */
+ 1.22474479675292968750e+00, /* 0x3ff3988e00000000 */
+ 1.23743629455566406250e+00, /* 0x3ff3cc8a00000000 */
+ 1.25000000000000000000e+00, /* 0x3ff4000000000000 */
+ 1.26243782043457031250e+00, /* 0x3ff432f200000000 */
+ 1.27475452423095703125e+00, /* 0x3ff4656500000000 */
+ 1.28695297241210937500e+00, /* 0x3ff4975c00000000 */
+ 1.29903793334960937500e+00, /* 0x3ff4c8dc00000000 */
+ 1.31101036071777343750e+00, /* 0x3ff4f9e600000000 */
+ 1.32287502288818359375e+00, /* 0x3ff52a7f00000000 */
+ 1.33463478088378906250e+00, /* 0x3ff55aaa00000000 */
+ 1.34629058837890625000e+00, /* 0x3ff58a6800000000 */
+ 1.35784721374511718750e+00, /* 0x3ff5b9be00000000 */
+ 1.36930561065673828125e+00, /* 0x3ff5e8ad00000000 */
+ 1.38066959381103515625e+00, /* 0x3ff6173900000000 */
+ 1.39194107055664062500e+00, /* 0x3ff6456400000000 */
+ 1.40312099456787109375e+00, /* 0x3ff6732f00000000 */
+ 1.41421318054199218750e+00, /* 0x3ff6a09e00000000 */
+ 1.42521858215332031250e+00, /* 0x3ff6cdb200000000 */
+ 1.43614006042480468750e+00, /* 0x3ff6fa6e00000000 */
+ 1.44697952270507812500e+00, /* 0x3ff726d400000000 */
+ 1.45773792266845703125e+00, /* 0x3ff752e500000000 */
+ 1.46841716766357421875e+00, /* 0x3ff77ea300000000 */
+ 1.47901916503906250000e+00, /* 0x3ff7aa1000000000 */
+ 1.48954677581787109375e+00, /* 0x3ff7d52f00000000 */
+ 1.50000000000000000000e+00, /* 0x3ff8000000000000 */
+ 1.51038074493408203125e+00, /* 0x3ff82a8500000000 */
+ 1.52068996429443359375e+00, /* 0x3ff854bf00000000 */
+ 1.53093051910400390625e+00, /* 0x3ff87eb100000000 */
+ 1.54110336303710937500e+00, /* 0x3ff8a85c00000000 */
+ 1.55120849609375000000e+00, /* 0x3ff8d1c000000000 */
+ 1.56124877929687500000e+00, /* 0x3ff8fae000000000 */
+ 1.57122516632080078125e+00, /* 0x3ff923bd00000000 */
+ 1.58113861083984375000e+00, /* 0x3ff94c5800000000 */
+ 1.59099006652832031250e+00, /* 0x3ff974b200000000 */
+ 1.60078048706054687500e+00, /* 0x3ff99ccc00000000 */
+ 1.61051177978515625000e+00, /* 0x3ff9c4a800000000 */
+ 1.62018489837646484375e+00, /* 0x3ff9ec4700000000 */
+ 1.62979984283447265625e+00, /* 0x3ffa13a900000000 */
+ 1.63935947418212890625e+00, /* 0x3ffa3ad100000000 */
+ 1.64886283874511718750e+00, /* 0x3ffa61be00000000 */
+ 1.65831184387207031250e+00, /* 0x3ffa887200000000 */
+ 1.66770744323730468750e+00, /* 0x3ffaaeee00000000 */
+ 1.67705059051513671875e+00, /* 0x3ffad53300000000 */
+ 1.68634128570556640625e+00, /* 0x3ffafb4100000000 */
+ 1.69558238983154296875e+00, /* 0x3ffb211b00000000 */
+ 1.70477199554443359375e+00, /* 0x3ffb46bf00000000 */
+ 1.71391296386718750000e+00, /* 0x3ffb6c3000000000 */
+ 1.72300529479980468750e+00, /* 0x3ffb916e00000000 */
+ 1.73204994201660156250e+00, /* 0x3ffbb67a00000000 */
+ 1.74104785919189453125e+00, /* 0x3ffbdb5500000000 */
+ 1.75000000000000000000e+00, /* 0x3ffc000000000000 */
+ 1.75890541076660156250e+00, /* 0x3ffc247a00000000 */
+ 1.76776695251464843750e+00, /* 0x3ffc48c600000000 */
+ 1.77658367156982421875e+00, /* 0x3ffc6ce300000000 */
+ 1.78535652160644531250e+00, /* 0x3ffc90d200000000 */
+ 1.79408740997314453125e+00, /* 0x3ffcb49500000000 */
+ 1.80277538299560546875e+00, /* 0x3ffcd82b00000000 */
+ 1.81142139434814453125e+00, /* 0x3ffcfb9500000000 */
+ 1.82002735137939453125e+00, /* 0x3ffd1ed500000000 */
+ 1.82859230041503906250e+00, /* 0x3ffd41ea00000000 */
+ 1.83711719512939453125e+00, /* 0x3ffd64d500000000 */
+ 1.84560203552246093750e+00, /* 0x3ffd879600000000 */
+ 1.85404872894287109375e+00, /* 0x3ffdaa2f00000000 */
+ 1.86245727539062500000e+00, /* 0x3ffdcca000000000 */
+ 1.87082862854003906250e+00, /* 0x3ffdeeea00000000 */
+ 1.87916183471679687500e+00, /* 0x3ffe110c00000000 */
+ 1.88745784759521484375e+00, /* 0x3ffe330700000000 */
+ 1.89571857452392578125e+00, /* 0x3ffe54dd00000000 */
+ 1.90394306182861328125e+00, /* 0x3ffe768d00000000 */
+ 1.91213226318359375000e+00, /* 0x3ffe981800000000 */
+ 1.92028617858886718750e+00, /* 0x3ffeb97e00000000 */
+ 1.92840576171875000000e+00, /* 0x3ffedac000000000 */
+ 1.93649101257324218750e+00, /* 0x3ffefbde00000000 */
+ 1.94454288482666015625e+00, /* 0x3fff1cd900000000 */
+ 1.95256233215332031250e+00, /* 0x3fff3db200000000 */
+ 1.96054744720458984375e+00, /* 0x3fff5e6700000000 */
+ 1.96850109100341796875e+00, /* 0x3fff7efb00000000 */
+ 1.97642326354980468750e+00, /* 0x3fff9f6e00000000 */
+ 1.98431301116943359375e+00, /* 0x3fffbfbf00000000 */
+ 1.99217128753662109375e+00, /* 0x3fffdfef00000000 */
+ 2.00000000000000000000e+00}; /* 0x4000000000000000 */
+
+ static const double rt_jby32_trail_table_dbl[97] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 9.17217678638807524014e-07, /* 0x3eaec6d70177881c */
+ 3.82539669043705364790e-07, /* 0x3e99abfb41bd6b24 */
+ 2.85899577162227138140e-08, /* 0x3e5eb2bf6bab55a2 */
+ 7.63210485349101216659e-07, /* 0x3ea99bed9b2d8d0c */
+ 9.32123004127716212874e-07, /* 0x3eaf46e029c1b296 */
+ 1.95174719169309219157e-07, /* 0x3e8a3226fc42f30c */
+ 5.34316371481845492427e-07, /* 0x3ea1edbe20701d73 */
+ 5.79631242504454563052e-07, /* 0x3ea372fe94f82be7 */
+ 4.20404384109571705948e-07, /* 0x3e9c367e08e7bb06 */
+ 6.89486030314147010716e-07, /* 0x3ea722a3d0a66608 */
+ 6.89927685625314560328e-07, /* 0x3ea7266f067ca1d6 */
+ 3.32778123013641425828e-07, /* 0x3e965515a9b34850 */
+ 1.64433259436999584387e-07, /* 0x3e8611e23ef6c1bd */
+ 4.37590875197899335723e-07, /* 0x3e9d5dc1059ed8e7 */
+ 1.79808183816018617413e-07, /* 0x3e88222982d0e4f4 */
+ 7.46386593615986477624e-08, /* 0x3e7409212e7d0322 */
+ 5.72520794105201454728e-07, /* 0x3ea335ea8a5fcf39 */
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 2.96860689431670420344e-07, /* 0x3e93ec071e938bfe */
+ 3.54167239176257065345e-07, /* 0x3e97c48bfd9862c6 */
+ 7.95211265664474710063e-07, /* 0x3eaaaed010f74671 */
+ 1.72327048595145565621e-07, /* 0x3e87211cbfeb62e0 */
+ 6.99494915996239297020e-07, /* 0x3ea7789d9660e72d */
+ 6.32644111701500844315e-07, /* 0x3ea53a5f1d36f1cf */
+ 6.20124838851440463844e-10, /* 0x3e054eacff2057dc */
+ 6.13404719757812629969e-07, /* 0x3ea4951b3e6a83cc */
+ 3.47654909777986407387e-07, /* 0x3e9754aa76884c66 */
+ 7.83106177002392475763e-07, /* 0x3eaa46d4b1de1074 */
+ 5.33337372440526357008e-07, /* 0x3ea1e55548f92635 */
+ 2.01508648555298681765e-08, /* 0x3e55a3070dd17788 */
+ 5.25472356925843939587e-07, /* 0x3ea1a1c5eedb0801 */
+ 3.81831102861301692797e-07, /* 0x3e999fcef32422cc */
+ 6.99220602161420018738e-07, /* 0x3ea776425d6b0199 */
+ 6.01209702477462624811e-07, /* 0x3ea42c5a1e0191a2 */
+ 9.01437000591944740554e-08, /* 0x3e7832a0bdff1327 */
+ 5.10428680864685379950e-08, /* 0x3e6b674743636676 */
+ 3.47895267104621031421e-07, /* 0x3e9758cb90d2f714 */
+ 7.80735841510641848628e-07, /* 0x3eaa3278459cde25 */
+ 1.35158752025506517690e-07, /* 0x3e822404f4a103ee */
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.76523947728535489812e-09, /* 0x3e1e539af6892ac5 */
+ 6.68280121328499932183e-07, /* 0x3ea66c7b872c9cd0 */
+ 5.70135482405123276616e-07, /* 0x3ea3216d2f43887d */
+ 1.37705134737562525897e-07, /* 0x3e827b832cbedc0e */
+ 7.09655107074516613672e-07, /* 0x3ea7cfe41579091d */
+ 7.20302724551461693011e-07, /* 0x3ea82b5a713c490a */
+ 4.69926266058212796694e-07, /* 0x3e9f8945932d872e */
+ 2.19244345915999437026e-07, /* 0x3e8d6d2da9490251 */
+ 1.91141411617401877927e-07, /* 0x3e89a791a3114e4a */
+ 5.72297665296622053774e-07, /* 0x3ea333ffe005988d */
+ 5.61055484436830560103e-07, /* 0x3ea2d36e0ed49ab1 */
+ 2.76225500213991506100e-07, /* 0x3e92898498f55f9e */
+ 7.58466189522395692908e-07, /* 0x3ea9732cca1032a3 */
+ 1.56893371256836029827e-07, /* 0x3e850ed0b02a22d2 */
+ 4.06038997708867066507e-07, /* 0x3e9b3fb265b1e40a */
+ 5.51305629612057435809e-07, /* 0x3ea27fade682d1de */
+ 5.64778487026561123207e-07, /* 0x3ea2f36906f707ba */
+ 3.92609705553556897517e-07, /* 0x3e9a58fbbee883b6 */
+ 9.09698438776943827802e-07, /* 0x3eae864005bca6d7 */
+ 1.05949774066016139743e-07, /* 0x3e7c70d02300f263 */
+ 7.16578798392844784244e-07, /* 0x3ea80b5d712d8e3e */
+ 6.86233073531233972561e-07, /* 0x3ea706b27cc7d390 */
+ 7.99211473033494452908e-07, /* 0x3eaad12c9d849a97 */
+ 8.65552275731027456121e-07, /* 0x3ead0b09954e764b */
+ 6.75456120386058448618e-07, /* 0x3ea6aa1fb7826cbd */
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 4.99167184520462138743e-07, /* 0x3ea0bfd03f46763c */
+ 4.51720373502110930296e-10, /* 0x3dff0abfb4adfb9e */
+ 1.28874162718371367439e-07, /* 0x3e814c151f991b2e */
+ 5.85529267186999798656e-07, /* 0x3ea3a5a879b09292 */
+ 1.01827770937125531924e-07, /* 0x3e7b558d173f9796 */
+ 2.54736389177809626508e-07, /* 0x3e9118567cd83fb8 */
+ 6.98925535290464831294e-07, /* 0x3ea773b981896751 */
+ 1.20940735036524314513e-07, /* 0x3e803b7df49f48a8 */
+ 5.43759351196479689657e-08, /* 0x3e6d315f22491900 */
+ 1.11957989042397958409e-07, /* 0x3e7e0db1c5bb84b2 */
+ 8.47006714134442661218e-07, /* 0x3eac6bbb7644ff76 */
+ 8.92831044643427836228e-07, /* 0x3eadf55c3afec01f */
+ 7.77828292464916501663e-07, /* 0x3eaa197e81034da3 */
+ 6.48469316302918797451e-08, /* 0x3e71683f4920555d */
+ 2.12579816658859849140e-07, /* 0x3e8c882fd78bb0b0 */
+ 7.61222472580559138435e-07, /* 0x3ea98ad9eb7b83ec */
+ 2.86488961857314189607e-07, /* 0x3e9339d7c7777273 */
+ 2.14637363790165363515e-07, /* 0x3e8ccee237cae6fe */
+ 5.44137005612605847831e-08, /* 0x3e6d368fe324a146 */
+ 2.58378284856442408413e-07, /* 0x3e9156e7b6d99b45 */
+ 3.15848939061134843091e-07, /* 0x3e95323e5310b5c1 */
+ 6.60530466255089632309e-07, /* 0x3ea629e9db362f5d */
+ 7.63436345535852301127e-07, /* 0x3ea99dde4728d7ec */
+ 8.68233432860324345268e-08, /* 0x3e774e746878544d */
+ 9.45465175398023087082e-07, /* 0x3eafb97be873a87d */
+ 8.77499534786171267246e-07, /* 0x3ead71a9e23c2f63 */
+ 2.74055432394999316135e-07, /* 0x3e92643c89cda173 */
+ 4.72129009349126213532e-07, /* 0x3e9faf1d57a4d56c */
+ 8.93777032327078947306e-07, /* 0x3eadfd7c7ab7b282 */
+ 0.00000000000000000000e+00}; /* 0x0000000000000000 */
+
+
+ /* Handle special arguments first */
+
+ GET_BITS_DP64(x, ux);
+ ax = ux & (~SIGNBIT_DP64);
+
+ if(ax >= 0x7ff0000000000000)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_DP64)
+ /* x is NaN */
+ return x + x; /* Raise invalid if it is a signalling NaN */
+ else if (ux & SIGNBIT_DP64)
+ /* x is negative infinity */
+ return nan_with_flags(AMD_F_INVALID);
+ else
+ /* x is positive infinity */
+ return x;
+ }
+ else if (ux & SIGNBIT_DP64)
+ {
+ /* x is negative. */
+ if (ux == SIGNBIT_DP64)
+ /* Handle negative zero first */
+ return x;
+ else
+ return nan_with_flags(AMD_F_INVALID);
+ }
+ else if (ux <= 0x000fffffffffffff)
+ {
+ /* x is denormalised or zero */
+ if (ux == 0)
+ /* x is zero */
+ return x;
+ else
+ {
+ /* x is denormalised; scale it up */
+ /* Normalize x by increasing the exponent by 60
+ and subtracting a correction to account for the implicit
+ bit. This replaces a slow denormalized
+ multiplication by a fast normal subtraction. */
+ static const double corr = 2.5653355008114851558350183e-290; /* 0x03d0000000000000 */
+ denorm = 1;
+ GET_BITS_DP64(x, ux);
+ PUT_BITS_DP64(ux | 0x03d0000000000000, x);
+ x -= corr;
+ GET_BITS_DP64(x, ux);
+ }
+ }
+
+ /* Main algorithm */
+
+ /*
+ Find y and e such that x = 2^e * y, where y in [1,4).
+ This is done using an in-lined variant of splitDouble,
+ which also ensures that e is even.
+ */
+ y = x;
+ ux &= EXPBITS_DP64;
+ ux >>= EXPSHIFTBITS_DP64;
+ if (ux & 1)
+ {
+ GET_BITS_DP64(y, u);
+ u &= (SIGNBIT_DP64 | MANTBITS_DP64);
+ u |= ONEEXPBITS_DP64;
+ PUT_BITS_DP64(u, y);
+ e = ux - EXPBIAS_DP64;
+ }
+ else
+ {
+ GET_BITS_DP64(y, u);
+ u &= (SIGNBIT_DP64 | MANTBITS_DP64);
+ u |= TWOEXPBITS_DP64;
+ PUT_BITS_DP64(u, y);
+ e = ux - EXPBIAS_DP64 - 1;
+ }
+
+
+ /* Find the index of the sub-interval of [1,4) in which y lies. */
+
+ index = (int)(32.0*y+0.5);
+
+ /* Look up the table values and compute c and r = c/t */
+
+ rtc_lead = rt_jby32_lead_table_dbl[index-32];
+ rtc_trail = rt_jby32_trail_table_dbl[index-32];
+ c = 0.03125*index;
+ r = (y - c)/c;
+
+ /*
+ Find q = sqrt(1+r) - 1.
+ From one step of Newton on (q+1)^2 = 1+r
+ */
+
+ p = r*0.5 - r*r*(0.1250079870 - r*(0.6250522999E-01));
+ twop = p + p;
+ q = p - (p*p + (twop - r))/(twop + 2.0);
+
+ /* Reconstruction */
+
+ rtc = rtc_lead + rtc_trail;
+ e >>= 1; /* e = e/2 */
+ z = rtc_lead + (rtc*q+rtc_trail);
+
+ if (denorm)
+ {
+ /* Scale by 2**(e-30) */
+ PUT_BITS_DP64(((long long)(e - 30) + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, r);
+ z *= r;
+ }
+ else
+ {
+ /* Scale by 2**e */
+ PUT_BITS_DP64(((long long)e + EXPBIAS_DP64) << EXPSHIFTBITS_DP64, r);
+ z *= r;
+ }
+
+ return z;
+
+}
+#endif /* SQRT_AMD_INLINE */
+
+#if defined(USE_SQRTF_AMD_INLINE)
+
+static inline float sqrtf_amd_inline(float x)
+{
+ /*
+ Computes the square root of x.
+
+ The calculation is carried out in three steps.
+
+ Step 1. Reduction.
+ The input argument is scaled to the interval [1, 4) by
+ computing
+ x = 2^e * y, where y in [1,4).
+ Furthermore y is decomposed as y = c + t where
+ c = 1 + j/32, j = 0,1,..,96; and |t| <= 1/64.
+
+ Step 2. Approximation.
+ An approximation q = sqrt(1 + (t/c)) - 1 is obtained
+ from a basic series expansion using precomputed values
+ stored in rt_jby32_lead_table_float and rt_jby32_trail_table_float.
+
+ Step 3. Reconstruction.
+ The value of sqrt(x) is reconstructed via
+ sqrt(x) = 2^(e/2) * sqrt(y)
+ = 2^(e/2) * sqrt(c) * sqrt(y/c)
+ = 2^(e/2) * sqrt(c) * sqrt(1 + t/c)
+ = 2^(e/2) * [ sqrt(c) + sqrt(c)*q ]
+ */
+
+ unsigned int ux, ax, u;
+ float r1, r2, c, y, p, q, r, twop, z, rtc, rtc_lead, rtc_trail;
+ int e, denorm = 0, index;
+
+/* Arrays rt_jby32_lead_table_float and rt_jby32_trail_table_float contain
+ leading and trailing parts respectively of precomputed
+ values of sqrt(j/32), for j = 32, 33, ..., 128.
+ rt_jby32_lead_table_float contains the first 13 bits of precision,
+ and rt_jby32_trail_table_float contains a further 24 bits precision. */
+
+static const float rt_jby32_lead_table_float[97] = {
+ 1.00000000000000000000e+00F, /* 0x3f800000 */
+ 1.01538085937500000000e+00F, /* 0x3f81f800 */
+ 1.03076171875000000000e+00F, /* 0x3f83f000 */
+ 1.04565429687500000000e+00F, /* 0x3f85d800 */
+ 1.06054687500000000000e+00F, /* 0x3f87c000 */
+ 1.07519531250000000000e+00F, /* 0x3f89a000 */
+ 1.08959960937500000000e+00F, /* 0x3f8b7800 */
+ 1.10375976562500000000e+00F, /* 0x3f8d4800 */
+ 1.11791992187500000000e+00F, /* 0x3f8f1800 */
+ 1.13183593750000000000e+00F, /* 0x3f90e000 */
+ 1.14550781250000000000e+00F, /* 0x3f92a000 */
+ 1.15917968750000000000e+00F, /* 0x3f946000 */
+ 1.17236328125000000000e+00F, /* 0x3f961000 */
+ 1.18579101562500000000e+00F, /* 0x3f97c800 */
+ 1.19873046875000000000e+00F, /* 0x3f997000 */
+ 1.21191406250000000000e+00F, /* 0x3f9b2000 */
+ 1.22460937500000000000e+00F, /* 0x3f9cc000 */
+ 1.23730468750000000000e+00F, /* 0x3f9e6000 */
+ 1.25000000000000000000e+00F, /* 0x3fa00000 */
+ 1.26220703125000000000e+00F, /* 0x3fa19000 */
+ 1.27465820312500000000e+00F, /* 0x3fa32800 */
+ 1.28686523437500000000e+00F, /* 0x3fa4b800 */
+ 1.29882812500000000000e+00F, /* 0x3fa64000 */
+ 1.31079101562500000000e+00F, /* 0x3fa7c800 */
+ 1.32275390625000000000e+00F, /* 0x3fa95000 */
+ 1.33447265625000000000e+00F, /* 0x3faad000 */
+ 1.34619140625000000000e+00F, /* 0x3fac5000 */
+ 1.35766601562500000000e+00F, /* 0x3fadc800 */
+ 1.36914062500000000000e+00F, /* 0x3faf4000 */
+ 1.38061523437500000000e+00F, /* 0x3fb0b800 */
+ 1.39184570312500000000e+00F, /* 0x3fb22800 */
+ 1.40307617187500000000e+00F, /* 0x3fb39800 */
+ 1.41406250000000000000e+00F, /* 0x3fb50000 */
+ 1.42504882812500000000e+00F, /* 0x3fb66800 */
+ 1.43603515625000000000e+00F, /* 0x3fb7d000 */
+ 1.44677734375000000000e+00F, /* 0x3fb93000 */
+ 1.45751953125000000000e+00F, /* 0x3fba9000 */
+ 1.46826171875000000000e+00F, /* 0x3fbbf000 */
+ 1.47900390625000000000e+00F, /* 0x3fbd5000 */
+ 1.48950195312500000000e+00F, /* 0x3fbea800 */
+ 1.50000000000000000000e+00F, /* 0x3fc00000 */
+ 1.51025390625000000000e+00F, /* 0x3fc15000 */
+ 1.52050781250000000000e+00F, /* 0x3fc2a000 */
+ 1.53076171875000000000e+00F, /* 0x3fc3f000 */
+ 1.54101562500000000000e+00F, /* 0x3fc54000 */
+ 1.55102539062500000000e+00F, /* 0x3fc68800 */
+ 1.56103515625000000000e+00F, /* 0x3fc7d000 */
+ 1.57104492187500000000e+00F, /* 0x3fc91800 */
+ 1.58105468750000000000e+00F, /* 0x3fca6000 */
+ 1.59082031250000000000e+00F, /* 0x3fcba000 */
+ 1.60058593750000000000e+00F, /* 0x3fcce000 */
+ 1.61035156250000000000e+00F, /* 0x3fce2000 */
+ 1.62011718750000000000e+00F, /* 0x3fcf6000 */
+ 1.62963867187500000000e+00F, /* 0x3fd09800 */
+ 1.63916015625000000000e+00F, /* 0x3fd1d000 */
+ 1.64868164062500000000e+00F, /* 0x3fd30800 */
+ 1.65820312500000000000e+00F, /* 0x3fd44000 */
+ 1.66748046875000000000e+00F, /* 0x3fd57000 */
+ 1.67700195312500000000e+00F, /* 0x3fd6a800 */
+ 1.68627929687500000000e+00F, /* 0x3fd7d800 */
+ 1.69555664062500000000e+00F, /* 0x3fd90800 */
+ 1.70458984375000000000e+00F, /* 0x3fda3000 */
+ 1.71386718750000000000e+00F, /* 0x3fdb6000 */
+ 1.72290039062500000000e+00F, /* 0x3fdc8800 */
+ 1.73193359375000000000e+00F, /* 0x3fddb000 */
+ 1.74096679687500000000e+00F, /* 0x3fded800 */
+ 1.75000000000000000000e+00F, /* 0x3fe00000 */
+ 1.75878906250000000000e+00F, /* 0x3fe12000 */
+ 1.76757812500000000000e+00F, /* 0x3fe24000 */
+ 1.77636718750000000000e+00F, /* 0x3fe36000 */
+ 1.78515625000000000000e+00F, /* 0x3fe48000 */
+ 1.79394531250000000000e+00F, /* 0x3fe5a000 */
+ 1.80273437500000000000e+00F, /* 0x3fe6c000 */
+ 1.81127929687500000000e+00F, /* 0x3fe7d800 */
+ 1.81982421875000000000e+00F, /* 0x3fe8f000 */
+ 1.82836914062500000000e+00F, /* 0x3fea0800 */
+ 1.83691406250000000000e+00F, /* 0x3feb2000 */
+ 1.84545898437500000000e+00F, /* 0x3fec3800 */
+ 1.85400390625000000000e+00F, /* 0x3fed5000 */
+ 1.86230468750000000000e+00F, /* 0x3fee6000 */
+ 1.87060546875000000000e+00F, /* 0x3fef7000 */
+ 1.87915039062500000000e+00F, /* 0x3ff08800 */
+ 1.88745117187500000000e+00F, /* 0x3ff19800 */
+ 1.89550781250000000000e+00F, /* 0x3ff2a000 */
+ 1.90380859375000000000e+00F, /* 0x3ff3b000 */
+ 1.91210937500000000000e+00F, /* 0x3ff4c000 */
+ 1.92016601562500000000e+00F, /* 0x3ff5c800 */
+ 1.92822265625000000000e+00F, /* 0x3ff6d000 */
+ 1.93627929687500000000e+00F, /* 0x3ff7d800 */
+ 1.94433593750000000000e+00F, /* 0x3ff8e000 */
+ 1.95239257812500000000e+00F, /* 0x3ff9e800 */
+ 1.96044921875000000000e+00F, /* 0x3ffaf000 */
+ 1.96826171875000000000e+00F, /* 0x3ffbf000 */
+ 1.97631835937500000000e+00F, /* 0x3ffcf800 */
+ 1.98413085937500000000e+00F, /* 0x3ffdf800 */
+ 1.99194335937500000000e+00F, /* 0x3ffef800 */
+ 2.00000000000000000000e+00F}; /* 0x40000000 */
+
+static const float rt_jby32_trail_table_float[97] = {
+ 0.00000000000000000000e+00F, /* 0x00000000 */
+ 1.23941208585165441036e-04F, /* 0x3901f637 */
+ 1.46876545841223560274e-05F, /* 0x37766aff */
+ 1.70736297150142490864e-04F, /* 0x393307ad */
+ 1.13296780909877270460e-04F, /* 0x38ed99bf */
+ 9.53458802541717886925e-05F, /* 0x38c7f46e */
+ 1.25126505736261606216e-04F, /* 0x39033464 */
+ 2.10342666832730174065e-04F, /* 0x395c8f6e */
+ 1.14066875539720058441e-04F, /* 0x38ef3730 */
+ 8.72047676239162683487e-05F, /* 0x38b6e1b4 */
+ 1.36111237225122749805e-04F, /* 0x390eb915 */
+ 2.26244374061934649944e-05F, /* 0x37bdc99c */
+ 2.40658700931817293167e-04F, /* 0x397c5954 */
+ 6.31069415248930454254e-05F, /* 0x38845848 */
+ 2.27412077947519719601e-04F, /* 0x396e7577 */
+ 5.90185391047270968556e-06F, /* 0x36c6088a */
+ 1.35496389702893793583e-04F, /* 0x390e1409 */
+ 1.32179571664892137051e-04F, /* 0x390a99af */
+ 0.00000000000000000000e+00F, /* 0x00000000 */
+ 2.31086043640971183777e-04F, /* 0x39724fb0 */
+ 9.66752704698592424393e-05F, /* 0x38cabe24 */
+ 8.85332483449019491673e-05F, /* 0x38b9aaed */
+ 2.09980673389509320259e-04F, /* 0x395c2e42 */
+ 2.20044588786549866199e-04F, /* 0x3966bbc5 */
+ 1.21749282698146998882e-04F, /* 0x38ff53a6 */
+ 1.62125259521417319775e-04F, /* 0x392a002b */
+ 9.97955357888713479042e-05F, /* 0x38d14952 */
+ 1.81545779923908412457e-04F, /* 0x393e5d53 */
+ 1.65768768056295812130e-04F, /* 0x392dd237 */
+ 5.48927710042335093021e-05F, /* 0x38663caa */
+ 9.53875860432162880898e-05F, /* 0x38c80ad2 */
+ 4.53481625299900770187e-05F, /* 0x383e3438 */
+ 1.51062369695864617825e-04F, /* 0x391e667f */
+ 1.70453247847035527229e-04F, /* 0x3932bbb2 */
+ 1.05505387182347476482e-04F, /* 0x38dd42c6 */
+ 2.02269104192964732647e-04F, /* 0x39541833 */
+ 2.18442466575652360916e-04F, /* 0x39650db4 */
+ 1.55796806211583316326e-04F, /* 0x39235d63 */
+ 1.60395247803535312414e-05F, /* 0x37868c9e */
+ 4.49578510597348213196e-05F, /* 0x383c9120 */
+ 0.00000000000000000000e+00F, /* 0x00000000 */
+ 1.26840444863773882389e-04F, /* 0x39050079 */
+ 1.82820076588541269302e-04F, /* 0x393fb364 */
+ 1.69370483490638434887e-04F, /* 0x3931990b */
+ 8.78757418831810355186e-05F, /* 0x38b849ee */
+ 1.83815121999941766262e-04F, /* 0x3940be7f */
+ 2.14343352126888930798e-04F, /* 0x3960c15b */
+ 1.80714370799250900745e-04F, /* 0x393d7e25 */
+ 8.41425862745381891727e-05F, /* 0x38b075b5 */
+ 1.69945167726837098598e-04F, /* 0x3932334f */
+ 1.95121858268976211548e-04F, /* 0x394c99a0 */
+ 1.60778334247879683971e-04F, /* 0x3928969b */
+ 6.79871009197086095810e-05F, /* 0x388e944c */
+ 1.61929419846273958683e-04F, /* 0x3929cb99 */
+ 1.99474830878898501396e-04F, /* 0x39512a1e */
+ 1.81604162207804620266e-04F, /* 0x393e6cff */
+ 1.09270178654696792364e-04F, /* 0x38e527fb */
+ 2.27539261686615645885e-04F, /* 0x396e979b */
+ 4.90300008095800876617e-05F, /* 0x384da590 */
+ 6.28985289949923753738e-05F, /* 0x3883e864 */
+ 2.58551553997676819563e-05F, /* 0x37d8e386 */
+ 1.82868374395184218884e-04F, /* 0x393fc05b */
+ 4.64625991298817098141e-05F, /* 0x3842e0d6 */
+ 1.05703387816902250051e-04F, /* 0x38ddad13 */
+ 1.17213814519345760345e-04F, /* 0x38f5d0b0 */
+ 8.17377731436863541603e-05F, /* 0x38ab6aa2 */
+ 0.00000000000000000000e+00F, /* 0x00000000 */
+ 1.16847433673683553934e-04F, /* 0x38f50bfd */
+ 1.88827965757809579372e-04F, /* 0x3946001f */
+ 2.16612941585481166840e-04F, /* 0x39632298 */
+ 2.00857131858356297016e-04F, /* 0x39529d2d */
+ 1.42199307447299361229e-04F, /* 0x39151b56 */
+ 4.12627305195201188326e-05F, /* 0x382d1185 */
+ 1.42796401632949709892e-04F, /* 0x3915bb9e */
+ 2.03253570361994206905e-04F, /* 0x39552077 */
+ 2.23214170546270906925e-04F, /* 0x396a0e99 */
+ 2.03244591830298304558e-04F, /* 0x39551e0e */
+ 1.43898156238719820976e-04F, /* 0x3916e35e */
+ 4.57155256299301981926e-05F, /* 0x383fbeac */
+ 1.53365719597786664963e-04F, /* 0x3920d0cc */
+ 2.23224633373320102692e-04F, /* 0x396a1168 */
+ 1.16566716314991936088e-05F, /* 0x37439106 */
+ 7.43694272387074306607e-06F, /* 0x36f98ada */
+ 2.11048507480882108212e-04F, /* 0x395d4ce7 */
+ 1.34682719362899661064e-04F, /* 0x390d399e */
+ 2.29425968427676707506e-05F, /* 0x37c074da */
+ 1.20421340398024767637e-04F, /* 0x38fc8ab7 */
+ 1.83421318070031702518e-04F, /* 0x394054c9 */
+ 2.12376224226318299770e-04F, /* 0x395eb14f */
+ 2.07710763788782060146e-04F, /* 0x3959ccef */
+ 1.69840845046564936638e-04F, /* 0x3932174e */
+ 9.91739216260612010956e-05F, /* 0x38cffb98 */
+ 2.40249748458154499531e-04F, /* 0x397beb8d */
+ 1.05178231024183332920e-04F, /* 0x38dc9322 */
+ 1.82623916771262884140e-04F, /* 0x393f7ebc */
+ 2.28821940254420042038e-04F, /* 0x396fefec */
+ 0.00000000000000000000e+00F}; /* 0x00000000 */
+
+
+/* Handle special arguments first */
+
+ GET_BITS_SP32(x, ux);
+ ax = ux & (~SIGNBIT_SP32);
+
+ if(ax >= 0x7f800000)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_SP32)
+ /* x is NaN */
+ return x + x; /* Raise invalid if it is a signalling NaN */
+ else if (ux & SIGNBIT_SP32)
+ return nanf_with_flags(AMD_F_INVALID);
+ else
+ /* x is positive infinity */
+ return x;
+ }
+ else if (ux & SIGNBIT_SP32)
+ {
+ /* x is negative. */
+ if (x == 0.0F)
+ /* Handle negative zero first */
+ return x;
+ else
+ return nanf_with_flags(AMD_F_INVALID);
+ }
+ else if (ux <= 0x007fffff)
+ {
+ /* x is denormalised or zero */
+ if (ux == 0)
+ /* x is zero */
+ return x;
+ else
+ {
+ /* x is denormalised; scale it up */
+ /* Normalize x by increasing the exponent by 26
+ and subtracting a correction to account for the implicit
+ bit. This replaces a slow denormalized
+ multiplication by a fast normal subtraction. */
+ static const float corr = 7.888609052210118054e-31F; /* 0x0d800000 */
+ denorm = 1;
+ GET_BITS_SP32(x, ux);
+ PUT_BITS_SP32(ux | 0x0d800000, x);
+ x -= corr;
+ GET_BITS_SP32(x, ux);
+ }
+ }
+
+ /* Main algorithm */
+
+ /*
+ Find y and e such that x = 2^e * y, where y in [1,4).
+ This is done using an in-lined variant of splitFloat,
+ which also ensures that e is even.
+ */
+ y = x;
+ ux &= EXPBITS_SP32;
+ ux >>= EXPSHIFTBITS_SP32;
+ if (ux & 1)
+ {
+ GET_BITS_SP32(y, u);
+ u &= (SIGNBIT_SP32 | MANTBITS_SP32);
+ u |= ONEEXPBITS_SP32;
+ PUT_BITS_SP32(u, y);
+ e = ux - EXPBIAS_SP32;
+ }
+ else
+ {
+ GET_BITS_SP32(y, u);
+ u &= (SIGNBIT_SP32 | MANTBITS_SP32);
+ u |= TWOEXPBITS_SP32;
+ PUT_BITS_SP32(u, y);
+ e = ux - EXPBIAS_SP32 - 1;
+ }
+
+ /* Find the index of the sub-interval of [1,4) in which y lies. */
+
+ index = (int)(32.0F*y+0.5);
+
+ /* Look up the table values and compute c and r = c/t */
+
+ rtc_lead = rt_jby32_lead_table_float[index-32];
+ rtc_trail = rt_jby32_trail_table_float[index-32];
+ c = 0.03125F*index;
+ r = (y - c)/c;
+
+ /*
+ Find q = sqrt(1+r) - 1.
+ From one step of Newton on (q+1)^2 = 1+r
+ */
+
+ p = r*0.5F - r*r*(0.1250079870F - r*(0.6250522999e-01F));
+ twop = p + p;
+ q = p - (p*p + (twop - r))/(twop + 2.0);
+
+ /* Reconstruction */
+
+ rtc = rtc_lead + rtc_trail;
+ e >>= 1; /* e = e/2 */
+ z = rtc_lead + (rtc*q+rtc_trail);
+
+ if (denorm)
+ {
+ /* Scale by 2**(e-13) */
+ PUT_BITS_SP32(((e - 13) + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, r);
+ z *= r;
+ }
+ else
+ {
+ /* Scale by 2**e */
+ PUT_BITS_SP32((e + EXPBIAS_SP32) << EXPSHIFTBITS_SP32, r);
+ z *= r;
+ }
+
+ return z;
+
+}
+#endif /* SQRTF_AMD_INLINE */
+
+#ifdef USE_LOG_KERNEL_AMD
+static inline void log_kernel_amd64(double x, unsigned long long ux, int *xexp, double *r1, double *r2)
+{
+
+ int expadjust;
+ double r, z1, z2, correction, f, f1, f2, q, u, v, poly;
+ int index;
+
+ /*
+ Computes natural log(x). Algorithm based on:
+ Ping-Tak Peter Tang
+ "Table-driven implementation of the logarithm function in IEEE
+ floating-point arithmetic"
+ ACM Transactions on Mathematical Software (TOMS)
+ Volume 16, Issue 4 (December 1990)
+ */
+
+/* Arrays ln_lead_table and ln_tail_table contain
+ leading and trailing parts respectively of precomputed
+ values of natural log(1+i/64), for i = 0, 1, ..., 64.
+ ln_lead_table contains the first 24 bits of precision,
+ and ln_tail_table contains a further 53 bits precision. */
+
+ static const double ln_lead_table[65] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.55041813850402832031e-02, /* 0x3f8fc0a800000000 */
+ 3.07716131210327148438e-02, /* 0x3f9f829800000000 */
+ 4.58095073699951171875e-02, /* 0x3fa7745800000000 */
+ 6.06245994567871093750e-02, /* 0x3faf0a3000000000 */
+ 7.52233862876892089844e-02, /* 0x3fb341d700000000 */
+ 8.96121263504028320312e-02, /* 0x3fb6f0d200000000 */
+ 1.03796780109405517578e-01, /* 0x3fba926d00000000 */
+ 1.17783010005950927734e-01, /* 0x3fbe270700000000 */
+ 1.31576299667358398438e-01, /* 0x3fc0d77e00000000 */
+ 1.45181953907012939453e-01, /* 0x3fc2955280000000 */
+ 1.58604979515075683594e-01, /* 0x3fc44d2b00000000 */
+ 1.71850204467773437500e-01, /* 0x3fc5ff3000000000 */
+ 1.84922337532043457031e-01, /* 0x3fc7ab8900000000 */
+ 1.97825729846954345703e-01, /* 0x3fc9525a80000000 */
+ 2.10564732551574707031e-01, /* 0x3fcaf3c900000000 */
+ 2.23143517971038818359e-01, /* 0x3fcc8ff780000000 */
+ 2.35566020011901855469e-01, /* 0x3fce270700000000 */
+ 2.47836112976074218750e-01, /* 0x3fcfb91800000000 */
+ 2.59957492351531982422e-01, /* 0x3fd0a324c0000000 */
+ 2.71933674812316894531e-01, /* 0x3fd1675c80000000 */
+ 2.83768117427825927734e-01, /* 0x3fd22941c0000000 */
+ 2.95464158058166503906e-01, /* 0x3fd2e8e280000000 */
+ 3.07025015354156494141e-01, /* 0x3fd3a64c40000000 */
+ 3.18453729152679443359e-01, /* 0x3fd4618bc0000000 */
+ 3.29753279685974121094e-01, /* 0x3fd51aad80000000 */
+ 3.40926527976989746094e-01, /* 0x3fd5d1bd80000000 */
+ 3.51976394653320312500e-01, /* 0x3fd686c800000000 */
+ 3.62905442714691162109e-01, /* 0x3fd739d7c0000000 */
+ 3.73716354370117187500e-01, /* 0x3fd7eaf800000000 */
+ 3.84411692619323730469e-01, /* 0x3fd89a3380000000 */
+ 3.94993782043457031250e-01, /* 0x3fd9479400000000 */
+ 4.05465066432952880859e-01, /* 0x3fd9f323c0000000 */
+ 4.15827870368957519531e-01, /* 0x3fda9cec80000000 */
+ 4.26084339618682861328e-01, /* 0x3fdb44f740000000 */
+ 4.36236739158630371094e-01, /* 0x3fdbeb4d80000000 */
+ 4.46287095546722412109e-01, /* 0x3fdc8ff7c0000000 */
+ 4.56237375736236572266e-01, /* 0x3fdd32fe40000000 */
+ 4.66089725494384765625e-01, /* 0x3fddd46a00000000 */
+ 4.75845873355865478516e-01, /* 0x3fde744240000000 */
+ 4.85507786273956298828e-01, /* 0x3fdf128f40000000 */
+ 4.95077252388000488281e-01, /* 0x3fdfaf5880000000 */
+ 5.04556000232696533203e-01, /* 0x3fe02552a0000000 */
+ 5.13945698738098144531e-01, /* 0x3fe0723e40000000 */
+ 5.23248136043548583984e-01, /* 0x3fe0be72e0000000 */
+ 5.32464742660522460938e-01, /* 0x3fe109f380000000 */
+ 5.41597247123718261719e-01, /* 0x3fe154c3c0000000 */
+ 5.50647079944610595703e-01, /* 0x3fe19ee6a0000000 */
+ 5.59615731239318847656e-01, /* 0x3fe1e85f40000000 */
+ 5.68504691123962402344e-01, /* 0x3fe23130c0000000 */
+ 5.77315330505371093750e-01, /* 0x3fe2795e00000000 */
+ 5.86049020290374755859e-01, /* 0x3fe2c0e9e0000000 */
+ 5.94707071781158447266e-01, /* 0x3fe307d720000000 */
+ 6.03290796279907226562e-01, /* 0x3fe34e2880000000 */
+ 6.11801505088806152344e-01, /* 0x3fe393e0c0000000 */
+ 6.20240390300750732422e-01, /* 0x3fe3d90260000000 */
+ 6.28608644008636474609e-01, /* 0x3fe41d8fe0000000 */
+ 6.36907458305358886719e-01, /* 0x3fe4618bc0000000 */
+ 6.45137906074523925781e-01, /* 0x3fe4a4f840000000 */
+ 6.53301239013671875000e-01, /* 0x3fe4e7d800000000 */
+ 6.61398470401763916016e-01, /* 0x3fe52a2d20000000 */
+ 6.69430613517761230469e-01, /* 0x3fe56bf9c0000000 */
+ 6.77398800849914550781e-01, /* 0x3fe5ad4040000000 */
+ 6.85303986072540283203e-01, /* 0x3fe5ee02a0000000 */
+ 6.93147122859954833984e-01}; /* 0x3fe62e42e0000000 */
+
+ static const double ln_tail_table[65] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 5.15092497094772879206e-09, /* 0x3e361f807c79f3db */
+ 4.55457209735272790188e-08, /* 0x3e6873c1980267c8 */
+ 2.86612990859791781788e-08, /* 0x3e5ec65b9f88c69e */
+ 2.23596477332056055352e-08, /* 0x3e58022c54cc2f99 */
+ 3.49498983167142274770e-08, /* 0x3e62c37a3a125330 */
+ 3.23392843005887000414e-08, /* 0x3e615cad69737c93 */
+ 1.35722380472479366661e-08, /* 0x3e4d256ab1b285e9 */
+ 2.56504325268044191098e-08, /* 0x3e5b8abcb97a7aa2 */
+ 5.81213608741512136843e-08, /* 0x3e6f34239659a5dc */
+ 5.59374849578288093334e-08, /* 0x3e6e07fd48d30177 */
+ 5.06615629004996189970e-08, /* 0x3e6b32df4799f4f6 */
+ 5.24588857848400955725e-08, /* 0x3e6c29e4f4f21cf8 */
+ 9.61968535632653505972e-10, /* 0x3e1086c848df1b59 */
+ 1.34829655346594463137e-08, /* 0x3e4cf456b4764130 */
+ 3.65557749306383026498e-08, /* 0x3e63a02ffcb63398 */
+ 3.33431709374069198903e-08, /* 0x3e61e6a6886b0976 */
+ 5.13008650536088382197e-08, /* 0x3e6b8abcb97a7aa2 */
+ 5.09285070380306053751e-08, /* 0x3e6b578f8aa35552 */
+ 3.20853940845502057341e-08, /* 0x3e6139c871afb9fc */
+ 4.06713248643004200446e-08, /* 0x3e65d5d30701ce64 */
+ 5.57028186706125221168e-08, /* 0x3e6de7bcb2d12142 */
+ 5.48356693724804282546e-08, /* 0x3e6d708e984e1664 */
+ 1.99407553679345001938e-08, /* 0x3e556945e9c72f36 */
+ 1.96585517245087232086e-09, /* 0x3e20e2f613e85bda */
+ 6.68649386072067321503e-09, /* 0x3e3cb7e0b42724f6 */
+ 5.89936034642113390002e-08, /* 0x3e6fac04e52846c7 */
+ 2.85038578721554472484e-08, /* 0x3e5e9b14aec442be */
+ 5.09746772910284482606e-08, /* 0x3e6b5de8034e7126 */
+ 5.54234668933210171467e-08, /* 0x3e6dc157e1b259d3 */
+ 6.29100830926604004874e-09, /* 0x3e3b05096ad69c62 */
+ 2.61974119468563937716e-08, /* 0x3e5c2116faba4cdd */
+ 4.16752115011186398935e-08, /* 0x3e665fcc25f95b47 */
+ 2.47747534460820790327e-08, /* 0x3e5a9a08498d4850 */
+ 5.56922172017964209793e-08, /* 0x3e6de647b1465f77 */
+ 2.76162876992552906035e-08, /* 0x3e5da71b7bf7861d */
+ 7.08169709942321478061e-09, /* 0x3e3e6a6886b09760 */
+ 5.77453510221151779025e-08, /* 0x3e6f0075eab0ef64 */
+ 4.43021445893361960146e-09, /* 0x3e33071282fb989b */
+ 3.15140984357495864573e-08, /* 0x3e60eb43c3f1bed2 */
+ 2.95077445089736670973e-08, /* 0x3e5faf06ecb35c84 */
+ 1.44098510263167149349e-08, /* 0x3e4ef1e63db35f68 */
+ 1.05196987538551827693e-08, /* 0x3e469743fb1a71a5 */
+ 5.23641361722697546261e-08, /* 0x3e6c1cdf404e5796 */
+ 7.72099925253243069458e-09, /* 0x3e4094aa0ada625e */
+ 5.62089493829364197156e-08, /* 0x3e6e2d4c96fde3ec */
+ 3.53090261098577946927e-08, /* 0x3e62f4d5e9a98f34 */
+ 3.80080516835568242269e-08, /* 0x3e6467c96ecc5cbe */
+ 5.66961038386146408282e-08, /* 0x3e6e7040d03dec5a */
+ 4.42287063097349852717e-08, /* 0x3e67bebf4282de36 */
+ 3.45294525105681104660e-08, /* 0x3e6289b11aeb783f */
+ 2.47132034530447431509e-08, /* 0x3e5a891d1772f538 */
+ 3.59655343422487209774e-08, /* 0x3e634f10be1fb591 */
+ 5.51581770357780862071e-08, /* 0x3e6d9ce1d316eb93 */
+ 3.60171867511861372793e-08, /* 0x3e63562a19a9c442 */
+ 1.94511067964296180547e-08, /* 0x3e54e2adf548084c */
+ 1.54137376631349347838e-08, /* 0x3e508ce55cc8c97a */
+ 3.93171034490174464173e-09, /* 0x3e30e2f613e85bda */
+ 5.52990607758839766440e-08, /* 0x3e6db03ebb0227bf */
+ 3.29990737637586136511e-08, /* 0x3e61b75bb09cb098 */
+ 1.18436010922446096216e-08, /* 0x3e496f16abb9df22 */
+ 4.04248680368301346709e-08, /* 0x3e65b3f399411c62 */
+ 2.27418915900284316293e-08, /* 0x3e586b3e59f65355 */
+ 1.70263791333409206020e-08, /* 0x3e52482ceae1ac12 */
+ 5.76999904754328540596e-08}; /* 0x3e6efa39ef35793c */
+
+ /* Approximating polynomial coefficients for x near 1.0 */
+ static const double
+ ca_1 = 8.33333333333317923934e-02, /* 0x3fb55555555554e6 */
+ ca_2 = 1.25000000037717509602e-02, /* 0x3f89999999bac6d4 */
+ ca_3 = 2.23213998791944806202e-03, /* 0x3f62492307f1519f */
+ ca_4 = 4.34887777707614552256e-04; /* 0x3f3c8034c85dfff0 */
+
+ /* Approximating polynomial coefficients for other x */
+ static const double
+ cb_1 = 8.33333333333333593622e-02, /* 0x3fb5555555555557 */
+ cb_2 = 1.24999999978138668903e-02, /* 0x3f89999999865ede */
+ cb_3 = 2.23219810758559851206e-03; /* 0x3f6249423bd94741 */
+
+ static const unsigned long long
+ log_thresh1 = 0x3fee0faa00000000,
+ log_thresh2 = 0x3ff1082c00000000;
+
+ /* log_thresh1 = 9.39412117004394531250e-1 = 0x3fee0faa00000000
+ log_thresh2 = 1.06449508666992187500 = 0x3ff1082c00000000 */
+ if (ux >= log_thresh1 && ux <= log_thresh2)
+ {
+ /* Arguments close to 1.0 are handled separately to maintain
+ accuracy.
+
+ The approximation in this region exploits the identity
+ log( 1 + r ) = log( 1 + u/2 ) / log( 1 - u/2 ), where
+ u = 2r / (2+r).
+ Note that the right hand side has an odd Taylor series expansion
+ which converges much faster than the Taylor series expansion of
+ log( 1 + r ) in r. Thus, we approximate log( 1 + r ) by
+ u + A1 * u^3 + A2 * u^5 + ... + An * u^(2n+1).
+
+ One subtlety is that since u cannot be calculated from
+ r exactly, the rounding error in the first u should be
+ avoided if possible. To accomplish this, we observe that
+ u = r - r*r/(2+r).
+ Since x (=1+r) is the input argument, and thus presumed exact,
+ the formula above approximates u accurately because
+ u = r - correction,
+ and the magnitude of "correction" (of the order of r*r)
+ is small.
+ With these observations, we will approximate log( 1 + r ) by
+ r + ( (A1*u^3 + ... + An*u^(2n+1)) - correction ).
+
+ We approximate log(1+r) by an odd polynomial in u, where
+ u = 2r/(2+r) = r - r*r/(2+r).
+ */
+ r = x - 1.0;
+ u = r / (2.0 + r);
+ correction = r * u;
+ u = u + u;
+ v = u * u;
+ z1 = r;
+ z2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ *r1 = z1;
+ *r2 = z2;
+ *xexp = 0;
+ }
+ else
+ {
+ /*
+ First, we decompose the argument x to the form
+ x = 2**M * (F1 + F2),
+ where 1 <= F1+F2 < 2, M has the value of an integer,
+ F1 = 1 + j/64, j ranges from 0 to 64, and |F2| <= 1/128.
+
+ Second, we approximate log( 1 + F2/F1 ) by an odd polynomial
+ in U, where U = 2 F2 / (2 F2 + F1).
+ Note that log( 1 + F2/F1 ) = log( 1 + U/2 ) - log( 1 - U/2 ).
+ The core approximation calculates
+ Poly = [log( 1 + U/2 ) - log( 1 - U/2 )]/U - 1.
+ Note that log(1 + U/2) - log(1 - U/2) = 2 arctanh ( U/2 ),
+ thus, Poly = 2 arctanh( U/2 ) / U - 1.
+
+ It is not hard to see that
+ log(x) = M*log(2) + log(F1) + log( 1 + F2/F1 ).
+ Hence, we return Z1 = log(F1), and Z2 = log( 1 + F2/F1).
+ The values of log(F1) are calculated beforehand and stored
+ in the program.
+ */
+
+ f = x;
+ if (ux < IMPBIT_DP64)
+ {
+ /* The input argument x is denormalized */
+ /* Normalize f by increasing the exponent by 60
+ and subtracting a correction to account for the implicit
+ bit. This replaces a slow denormalized
+ multiplication by a fast normal subtraction. */
+ static const double corr = 2.5653355008114851558350183e-290; /* 0x03d0000000000000 */
+ GET_BITS_DP64(f, ux);
+ ux |= 0x03d0000000000000;
+ PUT_BITS_DP64(ux, f);
+ f -= corr;
+ GET_BITS_DP64(f, ux);
+ expadjust = 60;
+ }
+ else
+ expadjust = 0;
+
+ /* Store the exponent of x in xexp and put
+ f into the range [0.5,1) */
+ *xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64 - expadjust;
+ PUT_BITS_DP64((ux & MANTBITS_DP64) | HALFEXPBITS_DP64, f);
+
+ /* Now x = 2**xexp * f, 1/2 <= f < 1. */
+
+ /* Set index to be the nearest integer to 128*f */
+ r = 128.0 * f;
+ index = (int)(r + 0.5);
+
+ z1 = ln_lead_table[index-64];
+ q = ln_tail_table[index-64];
+ f1 = index * 0.0078125; /* 0.0078125 = 1/128 */
+ f2 = f - f1;
+ /* At this point, x = 2**xexp * ( f1 + f2 ) where
+ f1 = j/128, j = 64, 65, ..., 128 and |f2| <= 1/256. */
+
+ /* Calculate u = 2 f2 / ( 2 f1 + f2 ) = f2 / ( f1 + 0.5*f2 ) */
+ /* u = f2 / (f1 + 0.5 * f2); */
+ u = f2 / (f1 + 0.5 * f2);
+
+ /* Here, |u| <= 2(exp(1/16)-1) / (exp(1/16)+1).
+ The core approximation calculates
+ poly = [log(1 + u/2) - log(1 - u/2)]/u - 1 */
+ v = u * u;
+ poly = (v * (cb_1 + v * (cb_2 + v * cb_3)));
+ z2 = q + (u + u * poly);
+ *r1 = z1;
+ *r2 = z2;
+ }
+ return;
+}
+#endif /* USE_LOG_KERNEL_AMD */
+
+#if defined(USE_REMAINDER_PIBY2F_INLINE)
+/* Define this to get debugging print statements activated */
+#define DEBUGGING_PRINT
+#undef DEBUGGING_PRINT
+
+
+#ifdef DEBUGGING_PRINT
+#include <stdio.h>
+char *d2b(long long d, int bitsper, int point)
+{
+ static char buff[200];
+ int i, j;
+ j = bitsper;
+ if (point >= 0 && point <= bitsper)
+ j++;
+ buff[j] = '\0';
+ for (i = bitsper - 1; i >= 0; i--)
+ {
+ j--;
+ if (d % 2 == 1)
+ buff[j] = '1';
+ else
+ buff[j] = '0';
+ if (i == point)
+ {
+ j--;
+ buff[j] = '.';
+ }
+ d /= 2;
+ }
+ return buff;
+}
+#endif
+
+/* Given positive argument x, reduce it to the range [-pi/4,pi/4] using
+ extra precision, and return the result in r.
+ Return value "region" tells how many lots of pi/2 were subtracted
+ from x to put it in the range [-pi/4,pi/4], mod 4. */
+static inline void __remainder_piby2f_inline(unsigned long long ux, double *r, int *region)
+{
+
+ /* This method simulates multi-precision floating-point
+ arithmetic and is accurate for all 1 <= x < infinity */
+#define bitsper 36
+ unsigned long long res[10];
+ unsigned long long u, carry, mask, mant, nextbits;
+ int first, last, i, rexp, xexp, resexp, ltb, determ, bc;
+ double dx;
+ static const double
+ piby2 = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */
+#ifdef WINDOWS
+ static unsigned long long pibits[] =
+ {
+ 0LL,
+ 5215LL, 13000023176LL, 11362338026LL, 67174558139LL,
+ 34819822259LL, 10612056195LL, 67816420731LL, 57840157550LL,
+ 19558516809LL, 50025467026LL, 25186875954LL, 18152700886LL
+ };
+#else
+ static unsigned long long pibits[] =
+ {
+ 0L,
+ 5215L, 13000023176L, 11362338026L, 67174558139L,
+ 34819822259L, 10612056195L, 67816420731L, 57840157550L,
+ 19558516809L, 50025467026L, 25186875954L, 18152700886L
+ };
+#endif
+
+#ifdef DEBUGGING_PRINT
+ printf("On entry, x = %25.20e = %s\n", x, double2hex(&x));
+#endif
+
+ xexp = (int)(((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64);
+ ux = ((ux & MANTBITS_DP64) | IMPBIT_DP64) >> 29;
+
+#ifdef DEBUGGING_PRINT
+ printf("ux = %s\n", d2b(ux, 64, -1));
+#endif
+
+ /* Now ux is the mantissa bit pattern of x as a long integer */
+ mask = 1;
+ mask = (mask << bitsper) - 1;
+
+ /* Set first and last to the positions of the first
+ and last chunks of 2/pi that we need */
+ first = xexp / bitsper;
+ resexp = xexp - first * bitsper;
+ /* 120 is the theoretical maximum number of bits (actually
+ 115 for IEEE single precision) that we need to extract
+ from the middle of 2/pi to compute the reduced argument
+ accurately enough for our purposes */
+ last = first + 120 / bitsper;
+
+#ifdef DEBUGGING_PRINT
+ printf("first = %d, last = %d\n", first, last);
+#endif
+
+ /* Do a long multiplication of the bits of 2/pi by the
+ integer mantissa */
+ /* Unroll the loop. This is only correct because we know
+ that bitsper is fixed as 36. */
+ res[4] = 0;
+ u = pibits[last] * ux;
+ res[3] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last - 1] * ux + carry;
+ res[2] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last - 2] * ux + carry;
+ res[1] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[first] * ux + carry;
+ res[0] = u & mask;
+
+#ifdef DEBUGGING_PRINT
+ printf("resexp = %d\n", resexp);
+ printf("Significant part of x * 2/pi with binary"
+ " point in correct place:\n");
+ for (i = 0; i <= last - first; i++)
+ {
+ if (i > 0 && i % 5 == 0)
+ printf("\n ");
+ if (i == 1)
+ printf("%s ", d2b(res[i], bitsper, resexp));
+ else
+ printf("%s ", d2b(res[i], bitsper, -1));
+ }
+ printf("\n");
+#endif
+
+ /* Reconstruct the result */
+ ltb = (int)((((res[0] << bitsper) | res[1])
+ >> (bitsper - 1 - resexp)) & 7);
+
+ /* determ says whether the fractional part is >= 0.5 */
+ determ = ltb & 1;
+
+#ifdef DEBUGGING_PRINT
+ printf("ltb = %d (last two bits before binary point"
+ " and first bit after)\n", ltb);
+ printf("determ = %d (1 means need to negate because the fractional\n"
+ " part of x * 2/pi is greater than 0.5)\n", determ);
+#endif
+
+ i = 1;
+ if (determ)
+ {
+ /* The mantissa is >= 0.5. We want to subtract it
+ from 1.0 by negating all the bits */
+ *region = ((ltb >> 1) + 1) & 3;
+ mant = 1;
+ mant = ~(res[1]) & ((mant << (bitsper - resexp)) - 1);
+ while (mant < 0x0000000000010000)
+ {
+ i++;
+ mant = (mant << bitsper) | (~(res[i]) & mask);
+ }
+ nextbits = (~(res[i+1]) & mask);
+ }
+ else
+ {
+ *region = (ltb >> 1);
+ mant = 1;
+ mant = res[1] & ((mant << (bitsper - resexp)) - 1);
+ while (mant < 0x0000000000010000)
+ {
+ i++;
+ mant = (mant << bitsper) | res[i];
+ }
+ nextbits = res[i+1];
+ }
+
+#ifdef DEBUGGING_PRINT
+ printf("First bits of mant = %s\n", d2b(mant, bitsper, -1));
+#endif
+
+ /* Normalize the mantissa. The shift value 6 here, determined by
+ trial and error, seems to give optimal speed. */
+ bc = 0;
+ while (mant < 0x0000400000000000LL)
+ {
+ bc += 6;
+ mant <<= 6;
+ }
+ while (mant < 0x0010000000000000LL)
+ {
+ bc++;
+ mant <<= 1;
+ }
+ mant |= nextbits >> (bitsper - bc);
+
+ rexp = 52 + resexp - bc - i * bitsper;
+
+#ifdef DEBUGGING_PRINT
+ printf("Normalised mantissa = 0x%016lx\n", mant);
+ printf("Exponent to be inserted on mantissa = rexp = %d\n", rexp);
+#endif
+
+ /* Put the result exponent rexp onto the mantissa pattern */
+ u = ((unsigned long long)rexp + EXPBIAS_DP64) << EXPSHIFTBITS_DP64;
+ ux = (mant & MANTBITS_DP64) | u;
+ if (determ)
+ /* If we negated the mantissa we negate x too */
+ ux |= SIGNBIT_DP64;
+ PUT_BITS_DP64(ux, dx);
+
+#ifdef DEBUGGING_PRINT
+ printf("(x*2/pi) = %25.20e = %s\n", dx, double2hex(&dx));
+#endif
+
+ /* x is a double precision version of the fractional part of
+ x * 2 / pi. Multiply x by pi/2 in double precision
+ to get the reduced argument r. */
+ *r = dx * piby2;
+
+#ifdef DEBUGGING_PRINT
+ printf(" r = frac(x*2/pi) * pi/2:\n");
+ printf(" r = %25.20e = %s\n", *r, double2hex(r));
+ printf("region = (number of pi/2 subtracted from x) mod 4 = %d\n",
+ *region);
+#endif
+}
+#endif /* USE_REMAINDER_PIBY2F_INLINE */
+
+#if defined(WINDOWS)
+#if defined(USE_HANDLE_ERROR) || defined(USE_HANDLE_ERRORF)
+#include <errno.h>
+#endif
+
+#if defined(USE_HANDLE_ERROR)
+/* Define the Microsoft specific error handling routines */
+static __declspec(noinline) double handle_error(const char *name,
+ unsigned long long value,
+ int type, int flags, int error,
+ double arg1, double arg2)
+{
+ double z;
+ struct _exception exception_data;
+ exception_data.type = type;
+ exception_data.name = (char*)name;
+ exception_data.arg1 = arg1;
+ exception_data.arg2 = arg2;
+ PUT_BITS_DP64(value, z);
+ exception_data.retval = z;
+ raise_fpsw_flags(flags);
+ if (!_matherr(&exception_data))
+ {
+ errno = error;
+ }
+ return exception_data.retval;
+}
+#endif /* USE_HANDLE_ERROR */
+
+#if defined(USE_HANDLE_ERRORF)
+static __declspec(noinline) float handle_errorf(const char *name,
+ unsigned int value,
+ int type, int flags, int error,
+ float arg1, float arg2)
+{
+ float z;
+ struct _exception exception_data;
+ exception_data.type = type;
+ exception_data.name = (char*)name;
+ exception_data.arg1 = (double)arg1;
+ exception_data.arg2 = (double)arg2;
+ PUT_BITS_SP32(value, z);
+ exception_data.retval = z;
+ raise_fpsw_flags(flags);
+ if (!_matherr(&exception_data))
+ {
+ errno = error;
+ }
+ return (float)exception_data.retval;
+}
+#endif /* USE_HANDLE_ERRORF */
+#endif /* WINDOWS */
+
+#endif /* LIBM_INLINES_AMD_H_INCLUDED */
diff --git a/inc/libm_special.h b/inc/libm_special.h
new file mode 100644
index 0000000..0833b7b
--- /dev/null
+++ b/inc/libm_special.h
@@ -0,0 +1,84 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef __LIBM_SPECIAL_H__
+#define __LIBM_SPECIAL_H__
+
+// exception status set
+#define MXCSR_ES_INEXACT 0x00000020
+#define MXCSR_ES_UNDERFLOW 0x00000010
+#define MXCSR_ES_OVERFLOW 0x00000008
+#define MXCSR_ES_DIVBYZERO 0x00000004
+#define MXCSR_ES_INVALID 0x00000001
+
+void __amd_handle_errorf(int type, int error, const char *name,
+ float arg1, unsigned int arg1_is_snan,
+ float arg2, unsigned int arg2_is_snan,
+ float retval, unsigned int retval_is_snan);
+
+void __amd_handle_error(int type, int error, const char *name,
+ double arg1,
+ double arg2,
+ double retval);
+
+/* Code from GRTE/v4 math.h */
+/* Types of exceptions in the `type' field. */
+#ifndef DOMAIN
+struct exception
+ {
+ int type;
+ char *name;
+ double arg1;
+ double arg2;
+ double retval;
+ };
+
+extern int matherr (struct exception *__exc);
+
+# define X_TLOSS 1.41484755040568800000e+16
+
+/* Types of exceptions in the `type' field. */
+# define DOMAIN 1
+# define SING 2
+# define OVERFLOW 3
+# define UNDERFLOW 4
+# define TLOSS 5
+# define PLOSS 6
+
+/* SVID mode specifies returning this large value instead of infinity. */
+# define HUGE 3.40282347e+38F
+
+/* Use this define to enable a (dummy) definition of matherr(). */
+#define NEED_FAKE_MATHERR
+
+#else /* !SVID */
+
+# ifdef __USE_XOPEN
+/* X/Open wants another strange constant. */
+# define MAXFLOAT 3.40282347e+38F
+# endif
+
+#endif /* DOMAIN */
+/* Code from GRTE/v4 math.h */
+
+#endif // __LIBM_SPECIAL_H__
diff --git a/inc/libm_util_amd.h b/inc/libm_util_amd.h
new file mode 100644
index 0000000..f7347d0
--- /dev/null
+++ b/inc/libm_util_amd.h
@@ -0,0 +1,195 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifndef LIBM_UTIL_AMD_H_INCLUDED
+#define LIBM_UTIL_AMD_H_INCLUDED 1
+
+
+
+
+
+
+typedef float F32;
+typedef unsigned int U32;
+typedef int S32;
+
+typedef double F64;
+typedef unsigned long long U64;
+typedef long long S64;
+
+union UT32_
+{
+ F32 f32;
+ U32 u32;
+};
+
+union UT64_
+{
+ F64 f64;
+ U64 u64;
+
+ F32 f32[2];
+ U32 u32[2];
+};
+
+typedef union UT32_ UT32;
+typedef union UT64_ UT64;
+
+
+
+
+#define QNAN_MASK_32 0x00400000
+#define QNAN_MASK_64 0x0008000000000000
+
+
+#define MULTIPLIER_SP 24
+#define MULTIPLIER_DP 53
+
+static const double VAL_2PMULTIPLIER_DP = 9007199254740992.0;
+static const double VAL_2PMMULTIPLIER_DP = 1.1102230246251565404236316680908e-16;
+static const float VAL_2PMULTIPLIER_SP = 16777216.0F;
+static const float VAL_2PMMULTIPLIER_SP = 5.9604645e-8F;
+
+
+
+
+
+/* Definitions for double functions on 64 bit machines */
+#define SIGNBIT_DP64 0x8000000000000000
+#define EXPBITS_DP64 0x7ff0000000000000
+#define MANTBITS_DP64 0x000fffffffffffff
+#define ONEEXPBITS_DP64 0x3ff0000000000000
+#define TWOEXPBITS_DP64 0x4000000000000000
+#define HALFEXPBITS_DP64 0x3fe0000000000000
+#define IMPBIT_DP64 0x0010000000000000
+#define QNANBITPATT_DP64 0x7ff8000000000000
+#define INDEFBITPATT_DP64 0xfff8000000000000
+#define PINFBITPATT_DP64 0x7ff0000000000000
+#define NINFBITPATT_DP64 0xfff0000000000000
+#define EXPBIAS_DP64 1023
+#define EXPSHIFTBITS_DP64 52
+#define BIASEDEMIN_DP64 1
+#define EMIN_DP64 -1022
+#define BIASEDEMAX_DP64 2046
+#define EMAX_DP64 1023
+#define LAMBDA_DP64 1.0e300
+#define MANTLENGTH_DP64 53
+#define BASEDIGITS_DP64 15
+
+
+/* These definitions, used by float functions,
+ are for both 32 and 64 bit machines */
+#define SIGNBIT_SP32 0x80000000
+#define EXPBITS_SP32 0x7f800000
+#define MANTBITS_SP32 0x007fffff
+#define ONEEXPBITS_SP32 0x3f800000
+#define TWOEXPBITS_SP32 0x40000000
+#define HALFEXPBITS_SP32 0x3f000000
+#define IMPBIT_SP32 0x00800000
+#define QNANBITPATT_SP32 0x7fc00000
+#define INDEFBITPATT_SP32 0xffc00000
+#define PINFBITPATT_SP32 0x7f800000
+#define NINFBITPATT_SP32 0xff800000
+#define EXPBIAS_SP32 127
+#define EXPSHIFTBITS_SP32 23
+#define BIASEDEMIN_SP32 1
+#define EMIN_SP32 -126
+#define BIASEDEMAX_SP32 254
+#define EMAX_SP32 127
+#define LAMBDA_SP32 1.0e30
+#define MANTLENGTH_SP32 24
+#define BASEDIGITS_SP32 7
+
+#define CLASS_SIGNALLING_NAN 1
+#define CLASS_QUIET_NAN 2
+#define CLASS_NEGATIVE_INFINITY 3
+#define CLASS_NEGATIVE_NORMAL_NONZERO 4
+#define CLASS_NEGATIVE_DENORMAL 5
+#define CLASS_NEGATIVE_ZERO 6
+#define CLASS_POSITIVE_ZERO 7
+#define CLASS_POSITIVE_DENORMAL 8
+#define CLASS_POSITIVE_NORMAL_NONZERO 9
+#define CLASS_POSITIVE_INFINITY 10
+
+#define OLD_BITS_SP32(x) (*((unsigned int *)&x))
+#define OLD_BITS_DP64(x) (*((unsigned long long *)&x))
+
+/* Alternatives to the above functions which don't have
+ problems when using high optimization levels on gcc */
+#define GET_BITS_SP32(x, ux) \
+ { \
+ volatile union {float f; unsigned int i;} _bitsy; \
+ _bitsy.f = (x); \
+ ux = _bitsy.i; \
+ }
+#define PUT_BITS_SP32(ux, x) \
+ { \
+ volatile union {float f; unsigned int i;} _bitsy; \
+ _bitsy.i = (ux); \
+ x = _bitsy.f; \
+ }
+
+#define GET_BITS_DP64(x, ux) \
+ { \
+ volatile union {double d; unsigned long long i;} _bitsy; \
+ _bitsy.d = (x); \
+ ux = _bitsy.i; \
+ }
+#define PUT_BITS_DP64(ux, x) \
+ { \
+ volatile union {double d; unsigned long long i;} _bitsy; \
+ _bitsy.i = (ux); \
+ x = _bitsy.d; \
+ }
+
+
+/* Processor-dependent floating-point status flags */
+#define AMD_F_INEXACT 0x00000020
+#define AMD_F_UNDERFLOW 0x00000010
+#define AMD_F_OVERFLOW 0x00000008
+#define AMD_F_DIVBYZERO 0x00000004
+#define AMD_F_INVALID 0x00000001
+
+/* Processor-dependent floating-point precision-control flags */
+#define AMD_F_EXTENDED 0x00000300
+#define AMD_F_DOUBLE 0x00000200
+#define AMD_F_SINGLE 0x00000000
+
+/* Processor-dependent floating-point rounding-control flags */
+#define AMD_F_RC_NEAREST 0x00000000
+#define AMD_F_RC_DOWN 0x00002000
+#define AMD_F_RC_UP 0x00004000
+#define AMD_F_RC_ZERO 0x00006000
+
+/* How to get hold of an assembly square root instruction:
+ * ASMQRT(x,y) computes y = sqrt(x).
+ */
+#ifdef WINDOWS
+/* VC++ intrinsic call */
+#define ASMSQRT(x,y) _mm_store_sd(&y, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&x)));
+#else
+/* Hammer sqrt instruction */
+#define ASMSQRT(x,y) asm volatile ("sqrtsd %1, %0" : "=x" (y) : "x" (x));
+#endif
+
+#endif /* LIBM_UTIL_AMD_H_INCLUDED */
diff --git a/libacml.h b/libacml.h
new file mode 100644
index 0000000..92c2ccb
--- /dev/null
+++ b/libacml.h
@@ -0,0 +1,76 @@
+// Copyright 2010 and onwards Google Inc.
+// Author: Martin Thuresson
+//
+// Expose fast k8 implementation of math functions with the prefix
+// "acml_". Currently acml_log(), acml_exp(), and acmp_pow() have
+// shown to have significantly better performance over glibc libm
+// and atleast as good precision.
+// https://wiki.corp.google.com/twiki/bin/view/Main/CompilerMathOptimization
+//
+// When build with --cpu=piii, acml_* will call the pure libm functions,
+// avoiding the need to special case the calls.
+//
+// TODO(martint): Update glibc to match the libacml performance.
+
+#ifndef THIRD_PARTY__OPEN64_LIBACML_MV__LIBACML_H_
+#define THIRD_PARTY__OPEN64_LIBACML_MV__LIBACML_H_
+
+#ifndef USE_LIBACML_IMPLEMENTATION
+#define USE_LIBACML_IMPLEMENTATION defined(__x86_64__)
+#endif
+
+#if USE_LIBACML_IMPLEMENTATION
+#include "third_party/open64_libacml_mv/inc/fn_macros.h"
+#else
+#include <math.h>
+#endif
+
+extern "C" {
+
+#if USE_LIBACML_IMPLEMENTATION
+// The k8 implementation of the math functions.
+#define acml_exp_k8 FN_PROTOTYPE(exp)
+#define acml_expf_k8 FN_PROTOTYPE(expf)
+#define acml_log_k8 FN_PROTOTYPE(log)
+#define acml_pow_k8 FN_PROTOTYPE(pow)
+double acml_exp_k8(double x);
+float acml_expf_k8(float x);
+double acml_log_k8(double x);
+double acml_pow_k8(double x, double y);
+#endif
+
+static inline double acml_exp(double x) {
+#if USE_LIBACML_IMPLEMENTATION
+ return acml_exp_k8(x);
+#else
+ return exp(x);
+#endif
+}
+
+static inline float acml_expf(float x) {
+#if USE_LIBACML_IMPLEMENTATION
+ return acml_expf_k8(x);
+#else
+ return expf(x);
+#endif
+}
+
+static inline double acml_log(double x) {
+#if USE_LIBACML_IMPLEMENTATION
+ return acml_log_k8(x);
+#else
+ return log(x);
+#endif
+}
+
+static inline double acml_pow(double x, double y) {
+#if USE_LIBACML_IMPLEMENTATION
+ return acml_pow_k8(x, y);
+#else
+ return pow(x, y);
+#endif
+}
+
+}
+
+#endif // THIRD_PARTY__OPEN64_LIBACML_MV__LIBACML_H_
diff --git a/libacml_portability_test.cc b/libacml_portability_test.cc
new file mode 100644
index 0000000..1f62d1a
--- /dev/null
+++ b/libacml_portability_test.cc
@@ -0,0 +1,16 @@
+#include "testing/base/public/gmock.h"
+#include "testing/base/public/gunit.h"
+#include "third_party/open64_libacml_mv/libacml.h"
+
+namespace {
+
+using ::testing::Eq;
+
+TEST(LibacmlPortabilityTest, Trivial) {
+ EXPECT_THAT(acml_exp(0), Eq(1));
+ EXPECT_THAT(acml_expf(0), Eq(1));
+ EXPECT_THAT(acml_pow(2, 2), Eq(4));
+ EXPECT_THAT(acml_log(1), Eq(0));
+}
+
+} // namespace
diff --git a/src/acos.c b/src/acos.c
new file mode 100644
index 0000000..26bac6c
--- /dev/null
+++ b/src/acos.c
@@ -0,0 +1,183 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VAL_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = x;
+ exc.name = (char *)"acos";
+ exc.type = DOMAIN;
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = HUGE;
+ else
+ exc.retval = nan_with_flags(AMD_F_INVALID);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("acos: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(acos)
+#endif
+
+double FN_PROTOTYPE(acos)(double x)
+{
+ /* Computes arccos(x).
+ The argument is first reduced by noting that arccos(x)
+ is invalid for abs(x) > 1. For denormal and small
+ arguments arccos(x) = pi/2 to machine accuracy.
+ Remaining argument ranges are handled as follows.
+ For abs(x) <= 0.5 use
+ arccos(x) = pi/2 - arcsin(x)
+ = pi/2 - (x + x^3*R(x^2))
+ where R(x^2) is a rational minimax approximation to
+ (arcsin(x) - x)/x^3.
+ For abs(x) > 0.5 exploit the identity:
+ arccos(x) = pi - 2*arcsin(sqrt(1-x)/2)
+ together with the above rational approximation, and
+ reconstruct the terms carefully.
+ */
+
+ /* Some constants and split constants. */
+
+ static const double
+ pi = 3.1415926535897933e+00, /* 0x400921fb54442d18 */
+ piby2 = 1.5707963267948965580e+00, /* 0x3ff921fb54442d18 */
+ piby2_head = 1.5707963267948965580e+00, /* 0x3ff921fb54442d18 */
+ piby2_tail = 6.12323399573676603587e-17; /* 0x3c91a62633145c07 */
+
+ double u, y, s=0.0, r;
+ int xexp, xnan, transform=0;
+
+ unsigned long long ux, aux, xneg;
+ GET_BITS_DP64(x, ux);
+ aux = ux & ~SIGNBIT_DP64;
+ xneg = (ux & SIGNBIT_DP64);
+ xnan = (aux > PINFBITPATT_DP64);
+ xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+
+ /* Special cases */
+
+ if (xnan)
+ {
+#ifdef WINDOWS
+ return handle_error("acos", ux|0x0008000000000000, _DOMAIN,
+ 0, EDOM, x, 0.0);
+#else
+ return x + x; /* With invalid if it's a signalling NaN */
+#endif
+ }
+ else if (xexp < -56)
+ { /* y small enough that arccos(x) = pi/2 */
+ return val_with_flags(piby2, AMD_F_INEXACT);
+ }
+ else if (xexp >= 0)
+ { /* abs(x) >= 1.0 */
+ if (x == 1.0)
+ return 0.0;
+ else if (x == -1.0)
+ return val_with_flags(pi, AMD_F_INEXACT);
+ else
+#ifdef WINDOWS
+ return handle_error("acos", INDEFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return retval_errno_edom(x);
+#endif
+ }
+
+ if (xneg) y = -x;
+ else y = x;
+
+ transform = (xexp >= -1); /* abs(x) >= 0.5 */
+
+ if (transform)
+ { /* Transform y into the range [0,0.5) */
+ r = 0.5*(1.0 - y);
+#ifdef WINDOWS
+ /* VC++ intrinsic call */
+ _mm_store_sd(&s, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&r)));
+#else
+ /* Hammer sqrt instruction */
+ asm volatile ("sqrtsd %1, %0" : "=x" (s) : "x" (r));
+#endif
+ y = s;
+ }
+ else
+ r = y*y;
+
+ /* Use a rational approximation for [0.0, 0.5] */
+
+ u = r*(0.227485835556935010735943483075 +
+ (-0.445017216867635649900123110649 +
+ (0.275558175256937652532686256258 +
+ (-0.0549989809235685841612020091328 +
+ (0.00109242697235074662306043804220 +
+ 0.0000482901920344786991880522822991*r)*r)*r)*r)*r)/
+ (1.36491501334161032038194214209 +
+ (-3.28431505720958658909889444194 +
+ (2.76568859157270989520376345954 +
+ (-0.943639137032492685763471240072 +
+ 0.105869422087204370341222318533*r)*r)*r)*r);
+
+ if (transform)
+ { /* Reconstruct acos carefully in transformed region */
+ if (xneg) return pi - 2.0*(s+(y*u - piby2_tail));
+ else
+ {
+ double c, s1;
+ unsigned long long us;
+ GET_BITS_DP64(s, us);
+ PUT_BITS_DP64(0xffffffff00000000 & us, s1);
+ c = (r-s1*s1)/(s+s1);
+ return 2.0*s1 + (2.0*c+2.0*y*u);
+ }
+ }
+ else
+ return piby2_head - (x - (piby2_tail - x*u));
+}
+
+weak_alias (__acos, acos)
diff --git a/src/acosf.c b/src/acosf.c
new file mode 100644
index 0000000..4464661
--- /dev/null
+++ b/src/acosf.c
@@ -0,0 +1,181 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VALF_WITH_FLAGS
+#define USE_NANF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NANF_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x)
+{
+ struct exception exc;
+ exc.arg1 = (double)x;
+ exc.arg2 = (double)x;
+ exc.name = (char *)"acosf";
+ exc.type = DOMAIN;
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = HUGE;
+ else
+ exc.retval = nanf_with_flags(AMD_F_INVALID);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("acosf: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(acosf)
+#endif
+
+float FN_PROTOTYPE(acosf)(float x)
+{
+ /* Computes arccos(x).
+ The argument is first reduced by noting that arccos(x)
+ is invalid for abs(x) > 1. For denormal and small
+ arguments arccos(x) = pi/2 to machine accuracy.
+ Remaining argument ranges are handled as follows.
+ For abs(x) <= 0.5 use
+ arccos(x) = pi/2 - arcsin(x)
+ = pi/2 - (x + x^3*R(x^2))
+ where R(x^2) is a rational minimax approximation to
+ (arcsin(x) - x)/x^3.
+ For abs(x) > 0.5 exploit the identity:
+ arccos(x) = pi - 2*arcsin(sqrt(1-x)/2)
+ together with the above rational approximation, and
+ reconstruct the terms carefully.
+ */
+
+ /* Some constants and split constants. */
+
+ static const float
+ piby2 = 1.5707963705e+00F; /* 0x3fc90fdb */
+ static const double
+ pi = 3.1415926535897933e+00, /* 0x400921fb54442d18 */
+ piby2_head = 1.5707963267948965580e+00, /* 0x3ff921fb54442d18 */
+ piby2_tail = 6.12323399573676603587e-17; /* 0x3c91a62633145c07 */
+
+ float u, y, s = 0.0F, r;
+ int xexp, xnan, transform = 0;
+
+ unsigned int ux, aux, xneg;
+
+ GET_BITS_SP32(x, ux);
+ aux = ux & ~SIGNBIT_SP32;
+ xneg = (ux & SIGNBIT_SP32);
+ xnan = (aux > PINFBITPATT_SP32);
+ xexp = (int)((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+
+ /* Special cases */
+
+ if (xnan)
+ {
+#ifdef WINDOWS
+ return handle_errorf("acosf", ux|0x00400000, _DOMAIN, 0,
+ EDOM, x, 0.0F);
+#else
+ return x + x; /* With invalid if it's a signalling NaN */
+#endif
+ }
+ else if (xexp < -26)
+ /* y small enough that arccos(x) = pi/2 */
+ return valf_with_flags(piby2, AMD_F_INEXACT);
+ else if (xexp >= 0)
+ { /* abs(x) >= 1.0 */
+ if (x == 1.0F)
+ return 0.0F;
+ else if (x == -1.0F)
+ return valf_with_flags((float)pi, AMD_F_INEXACT);
+ else
+#ifdef WINDOWS
+ return handle_errorf("acosf", INDEFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return retval_errno_edom(x);
+#endif
+ }
+
+ if (xneg) y = -x;
+ else y = x;
+
+ transform = (xexp >= -1); /* abs(x) >= 0.5 */
+
+ if (transform)
+ { /* Transform y into the range [0,0.5) */
+ r = 0.5F*(1.0F - y);
+#ifdef WINDOWS
+ /* VC++ intrinsic call */
+ _mm_store_ss(&s, _mm_sqrt_ss(_mm_load_ss(&r)));
+#else
+ /* Hammer sqrt instruction */
+ asm volatile ("sqrtss %1, %0" : "=x" (s) : "x" (r));
+#endif
+ y = s;
+ }
+ else
+ r = y*y;
+
+ /* Use a rational approximation for [0.0, 0.5] */
+
+ u=r*(0.184161606965100694821398249421F +
+ (-0.0565298683201845211985026327361F +
+ (-0.0133819288943925804214011424456F -
+ 0.00396137437848476485201154797087F*r)*r)*r)/
+ (1.10496961524520294485512696706F -
+ 0.836411276854206731913362287293F*r);
+
+ if (transform)
+ {
+ /* Reconstruct acos carefully in transformed region */
+ if (xneg)
+ return (float)(pi - 2.0*(s+(y*u - piby2_tail)));
+ else
+ {
+ float c, s1;
+ unsigned int us;
+ GET_BITS_SP32(s, us);
+ PUT_BITS_SP32(0xffff0000 & us, s1);
+ c = (r-s1*s1)/(s+s1);
+ return 2.0F*s1 + (2.0F*c+2.0F*y*u);
+ }
+ }
+ else
+ return (float)(piby2_head - (x - (piby2_tail - x*u)));
+}
+
+weak_alias (__acosf, acosf)
diff --git a/src/acosh.c b/src/acosh.c
new file mode 100644
index 0000000..f1d62c6
--- /dev/null
+++ b/src/acosh.c
@@ -0,0 +1,447 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_NAN_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#define USE_LOG_KERNEL_AMD
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+#undef USE_LOG_KERNEL_AMD
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = x;
+ exc.type = DOMAIN;
+ exc.name = (char *)"acosh";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = -HUGE;
+ else
+ exc.retval = nan_with_flags(AMD_F_INVALID);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("acosh: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "acosh"
+double FN_PROTOTYPE(acosh)(double x)
+{
+
+ unsigned long long ux;
+ double r, rarg, r1, r2;
+ int xexp;
+
+ static const unsigned long long
+ recrteps = 0x4196a09e667f3bcd; /* 1/sqrt(eps) = 9.49062656242515593767e+07 */
+ /* log2_lead and log2_tail sum to an extra-precise version
+ of log(2) */
+
+ static const double
+ log2_lead = 6.93147122859954833984e-01, /* 0x3fe62e42e0000000 */
+ log2_tail = 5.76999904754328540596e-08; /* 0x3e6efa39ef35793c */
+
+
+ GET_BITS_DP64(x, ux);
+
+ if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_DP64)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is infinity */
+ if (ux & SIGNBIT_DP64)
+ /* x is negative infinity. Return a NaN. */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return retval_errno_edom(x);
+#endif
+ else
+ /* Return positive infinity with no signal */
+ return x;
+ }
+ }
+ else if ((ux & SIGNBIT_DP64) || (ux <= 0x3ff0000000000000))
+ {
+ /* x <= 1.0 */
+ if (ux == 0x3ff0000000000000)
+ {
+ /* x = 1.0; return zero. */
+ return 0.0;
+ }
+ else
+ {
+ /* x is less than 1.0. Return a NaN. */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return retval_errno_edom(x);
+#endif
+ }
+ }
+
+
+ if (ux > recrteps)
+ {
+ /* Arguments greater than 1/sqrt(epsilon) in magnitude are
+ approximated by acosh(x) = ln(2) + ln(x) */
+ /* log_kernel_amd(x) returns xexp, r1, r2 such that
+ log(x) = xexp*log(2) + r1 + r2 */
+ log_kernel_amd64(x, ux, &xexp, &r1, &r2);
+ /* Add (xexp+1) * log(2) to z1,z2 to get the result acosh(x).
+ The computed r1 is not subject to rounding error because
+ (xexp+1) has at most 10 significant bits, log(2) has 24 significant
+ bits, and r1 has up to 24 bits; and the exponents of r1
+ and r2 differ by at most 6. */
+ r1 = ((xexp+1) * log2_lead + r1);
+ r2 = ((xexp+1) * log2_tail + r2);
+ return r1 + r2;
+ }
+ else if (ux >= 0x4060000000000000)
+ {
+ /* 128.0 <= x <= 1/sqrt(epsilon) */
+ /* acosh for these arguments is approximated by
+ acosh(x) = ln(x + sqrt(x*x-1)) */
+ rarg = x*x-1.0;
+ /* Use assembly instruction to compute r = sqrt(rarg); */
+ ASMSQRT(rarg,r);
+ r += x;
+ GET_BITS_DP64(r, ux);
+ log_kernel_amd64(r, ux, &xexp, &r1, &r2);
+ r1 = (xexp * log2_lead + r1);
+ r2 = (xexp * log2_tail + r2);
+ return r1 + r2;
+ }
+ else
+ {
+ /* 1.0 < x <= 128.0 */
+ double u1, u2, v1, v2, w1, w2, hx, tx, t, r, s, p1, p2, a1, a2, c1, c2,
+ poly;
+ if (ux >= 0x3ff8000000000000)
+ {
+ /* 1.5 <= x <= 128.0 */
+ /* We use minimax polynomials,
+ based on Abramowitz and Stegun 4.6.32 series
+ expansion for acosh(x), with the log(2x) and 1/(2.2.x^2)
+ terms removed. We compensate for these two terms later.
+ */
+ t = x*x;
+ if (ux >= 0x4040000000000000)
+ {
+ /* [3,2] for 32.0 <= x <= 128.0 */
+ poly =
+ (0.45995704464157438175e-9 +
+ (-0.89080839823528631030e-9 +
+ (-0.10370522395596168095e-27 +
+ 0.35255386405811106347e-32 * t) * t) * t) /
+ (0.21941191335882074014e-8 +
+ (-0.10185073058358334569e-7 +
+ 0.95019562478430648685e-8 * t) * t);
+ }
+ else if (ux >= 0x4020000000000000)
+ {
+ /* [3,3] for 8.0 <= x <= 32.0 */
+ poly =
+ (-0.54903656589072526589e-10 +
+ (0.27646792387218569776e-9 +
+ (-0.26912957240626571979e-9 -
+ 0.86712268396736384286e-29 * t) * t) * t) /
+ (-0.24327683788655520643e-9 +
+ (0.20633757212593175571e-8 +
+ (-0.45438330985257552677e-8 +
+ 0.28707154390001678580e-8 * t) * t) * t);
+ }
+ else if (ux >= 0x4010000000000000)
+ {
+ /* [4,3] for 4.0 <= x <= 8.0 */
+ poly =
+ (-0.20827370596738166108e-6 +
+ (0.10232136919220422622e-5 +
+ (-0.98094503424623656701e-6 +
+ (-0.11615338819596146799e-18 +
+ 0.44511847799282297160e-21 * t) * t) * t) * t) /
+ (-0.92579451630913718588e-6 +
+ (0.76997374707496606639e-5 +
+ (-0.16727286999128481170e-4 +
+ 0.10463413698762590251e-4 * t) * t) * t);
+ }
+ else if (ux >= 0x4000000000000000)
+ {
+ /* [5,5] for 2.0 <= x <= 4.0 */
+ poly =
+ (-0.122195030526902362060e-7 +
+ (0.157894522814328933143e-6 +
+ (-0.579951798420930466109e-6 +
+ (0.803568881125803647331e-6 +
+ (-0.373906657221148667374e-6 -
+ 0.317856399083678204443e-21 * t) * t) * t) * t) * t) /
+ (-0.516260096352477148831e-7 +
+ (0.894662592315345689981e-6 +
+ (-0.475662774453078218581e-5 +
+ (0.107249291567405130310e-4 +
+ (-0.107871445525891289759e-4 +
+ 0.398833767702587224253e-5 * t) * t) * t) * t) * t);
+ }
+ else if (ux >= 0x3ffc000000000000)
+ {
+ /* [5,4] for 1.75 <= x <= 2.0 */
+ poly =
+ (0.1437926821253825186e-3 +
+ (-0.1034078230246627213e-2 +
+ (0.2015310005461823437e-2 +
+ (-0.1159685218876828075e-2 +
+ (-0.9267353551307245327e-11 +
+ 0.2880267770324388034e-12 * t) * t) * t) * t) * t) /
+ (0.6305521447028109891e-3 +
+ (-0.6816525887775002944e-2 +
+ (0.2228081831550003651e-1 +
+ (-0.2836886105406603318e-1 +
+ 0.1236997707206036752e-1 * t) * t) * t) * t);
+ }
+ else
+ {
+ /* [5,4] for 1.5 <= x <= 1.75 */
+ poly =
+ ( 0.7471936607751750826e-3 +
+ (-0.4849405284371905506e-2 +
+ (0.8823068059778393019e-2 +
+ (-0.4825395461288629075e-2 +
+ (-0.1001984320956564344e-8 +
+ 0.4299919281586749374e-10 * t) * t) * t) * t) * t) /
+ (0.3322359141239411478e-2 +
+ (-0.3293525930397077675e-1 +
+ (0.1011351440424239210e0 +
+ (-0.1227083591622587079e0 +
+ 0.5147099404383426080e-1 * t) * t) * t) * t);
+ }
+ GET_BITS_DP64(x, ux);
+ log_kernel_amd64(x, ux, &xexp, &r1, &r2);
+ r1 = ((xexp+1) * log2_lead + r1);
+ r2 = ((xexp+1) * log2_tail + r2);
+ /* Now (r1,r2) sum to log(2x). Subtract the term
+ 1/(2.2.x^2) = 0.25/t, and add poly/t, carefully
+ to maintain precision. (Note that we add poly/t
+ rather than poly because of the *x factor used
+ when generating the minimax polynomial) */
+ v2 = (poly-0.25)/t;
+ r = v2 + r1;
+ s = ((r1 - r) + v2) + r2;
+ v1 = r + s;
+ return v1 + ((r - v1) + s);
+ }
+
+ /* Here 1.0 <= x <= 1.5. It is hard to maintain accuracy here so
+ we have to go to great lengths to do so. */
+
+ /* We compute the value
+ t = x - 1.0 + sqrt(2.0*(x - 1.0) + (x - 1.0)*(x - 1.0))
+ using simulated quad precision. */
+ t = x - 1.0;
+ u1 = t * 2.0;
+
+ /* dekker_mul12(t,t,&v1,&v2); */
+ GET_BITS_DP64(t, ux);
+ ux &= 0xfffffffff8000000;
+ PUT_BITS_DP64(ux, hx);
+ tx = t - hx;
+ v1 = t * t;
+ v2 = (((hx * hx - v1) + hx * tx) + tx * hx) + tx * tx;
+
+ /* dekker_add2(u1,0.0,v1,v2,&w1,&w2); */
+ r = u1 + v1;
+ s = (((u1 - r) + v1) + v2);
+ w1 = r + s;
+ w2 = (r - w1) + s;
+
+ /* dekker_sqrt2(w1,w2,&u1,&u2); */
+ ASMSQRT(w1,p1);
+ GET_BITS_DP64(p1, ux);
+ ux &= 0xfffffffff8000000;
+ PUT_BITS_DP64(ux, c1);
+ c2 = p1 - c1;
+ a1 = p1 * p1;
+ a2 = (((c1 * c1 - a1) + c1 * c2) + c2 * c1) + c2 * c2;
+ p2 = (((w1 - a1) - a2) + w2) * 0.5 / p1;
+ u1 = p1 + p2;
+ u2 = (p1 - u1) + p2;
+
+ /* dekker_add2(u1,u2,t,0.0,&v1,&v2); */
+ r = u1 + t;
+ s = (((u1 - r) + t)) + u2;
+ r1 = r + s;
+ r2 = (r - r1) + s;
+ t = r1 + r2;
+
+ /* Check for x close to 1.0. */
+ if (x < 1.13)
+ {
+ /* Here 1.0 <= x < 1.13 implies r <= 0.656. In this region
+ we need to take extra care to maintain precision.
+ We have t = r1 + r2 = (x - 1.0 + sqrt(x*x-1.0))
+ to more than basic precision. We use the Taylor series
+ for log(1+x), with terms after the O(x*x) term
+ approximated by a [6,6] minimax polynomial. */
+ double b1, b2, c1, c2, e1, e2, q1, q2, c, cc, hr1, tr1, hpoly, tpoly, hq1, tq1, hr2, tr2;
+ poly =
+ (0.30893760556597282162e-21 +
+ (0.10513858797132174471e0 +
+ (0.27834538302122012381e0 +
+ (0.27223638654807468186e0 +
+ (0.12038958198848174570e0 +
+ (0.23357202004546870613e-1 +
+ (0.15208417992520237648e-2 +
+ 0.72741030690878441996e-7 * t) * t) * t) * t) * t) * t) * t) /
+ (0.31541576391396523486e0 +
+ (0.10715979719991342022e1 +
+ (0.14311581802952004012e1 +
+ (0.94928647994421895988e0 +
+ (0.32396235926176348977e0 +
+ (0.52566134756985833588e-1 +
+ 0.30477895574211444963e-2 * t) * t) * t) * t) * t) * t);
+
+ /* Now we can compute the result r = acosh(x) = log1p(t)
+ using the formula t - 0.5*t*t + poly*t*t. Since t is
+ represented as r1+r2, the formula becomes
+ r = r1+r2 - 0.5*(r1+r2)*(r1+r2) + poly*(r1+r2)*(r1+r2).
+ Expanding out, we get
+ r = r1 + r2 - (0.5 + poly)*(r1*r1 + 2*r1*r2 + r2*r2)
+ and ignoring negligible quantities we get
+ r = r1 + r2 - 0.5*r1*r1 + r1*r2 + poly*t*t
+ */
+ if (x < 1.06)
+ {
+ double b, c, e;
+ b = r1*r2;
+ c = 0.5*r1*r1;
+ e = poly*t*t;
+ /* N.B. the order of additions and subtractions is important */
+ r = (((r2 - b) + e) - c) + r1;
+ return r;
+ }
+ else
+ {
+ /* For 1.06 <= x <= 1.13 we must evaluate in extended precision
+ to reach about 1 ulp accuracy (in this range the simple code
+ above only manages about 1.5 ulp accuracy) */
+
+ /* Split poly, r1 and r2 into head and tail sections */
+ GET_BITS_DP64(poly, ux);
+ ux &= 0xfffffffff8000000;
+ PUT_BITS_DP64(ux,hpoly);
+ tpoly = poly - hpoly;
+ GET_BITS_DP64(r1,ux);
+ ux &= 0xfffffffff8000000;
+ PUT_BITS_DP64(ux,hr1);
+ tr1 = r1 - hr1;
+ GET_BITS_DP64(r2, ux);
+ ux &= 0xfffffffff8000000;
+ PUT_BITS_DP64(ux,hr2);
+ tr2 = r2 - hr2;
+
+ /* e = poly*t*t */
+ c = poly * r1;
+ cc = (((hpoly * hr1 - c) + hpoly * tr1) + tpoly * hr1) + tpoly * tr1;
+ cc = poly * r2 + cc;
+ q1 = c + cc;
+ q2 = (c - q1) + cc;
+ GET_BITS_DP64(q1, ux);
+ ux &= 0xfffffffff8000000;
+ PUT_BITS_DP64(ux,hq1);
+ tq1 = q1 - hq1;
+ c = q1 * r1;
+ cc = (((hq1 * hr1 - c) + hq1 * tr1) + tq1 * hr1) + tq1 * tr1;
+ cc = q1 * r2 + q2 * r1 + cc;
+ e1 = c + cc;
+ e2 = (c - e1) + cc;
+
+ /* b = r1*r2 */
+ b1 = r1 * r2;
+ b2 = (((hr1 * hr2 - b1) + hr1 * tr2) + tr1 * hr2) + tr1 * tr2;
+
+ /* c = 0.5*r1*r1 */
+ c1 = (0.5*r1) * r1;
+ c2 = (((0.5*hr1 * hr1 - c1) + 0.5*hr1 * tr1) + 0.5*tr1 * hr1) + 0.5*tr1 * tr1;
+
+ /* v = a + d - b */
+ r = r1 - b1;
+ s = (((r1 - r) - b1) - b2) + r2;
+ v1 = r + s;
+ v2 = (r - v1) + s;
+
+ /* w = (a + d - b) - c */
+ r = v1 - c1;
+ s = (((v1 - r) - c1) - c2) + v2;
+ w1 = r + s;
+ w2 = (r - w1) + s;
+
+ /* u = ((a + d - b) - c) + e */
+ r = w1 + e1;
+ s = (((w1 - r) + e1) + e2) + w2;
+ u1 = r + s;
+ u2 = (r - u1) + s;
+
+ /* The result r = acosh(x) */
+ r = u1 + u2;
+
+ return r;
+ }
+ }
+ else
+ {
+ /* For arguments 1.13 <= x <= 1.5 the log1p function
+ is good enough */
+ return FN_PROTOTYPE(log1p)(t);
+ }
+ }
+}
+
+weak_alias (__acosh, acosh)
diff --git a/src/acoshf.c b/src/acoshf.c
new file mode 100644
index 0000000..c96fdb0
--- /dev/null
+++ b/src/acoshf.c
@@ -0,0 +1,149 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include <stdio.h>
+
+#define USE_NANF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NANF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x)
+{
+ struct exception exc;
+ exc.arg1 = (double)x;
+ exc.arg2 = (double)x;
+ exc.type = DOMAIN;
+ exc.name = (char *)"acoshf";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = -HUGE;
+ else
+ exc.retval = nanf_with_flags(AMD_F_INVALID);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("acoshf: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "acoshf"
+float FN_PROTOTYPE(acoshf)(float x)
+{
+
+ unsigned int ux;
+ double dx, r, rarg, t;
+
+ static const unsigned int
+ recrteps = 0x46000000; /* 1/sqrt(eps) = 4.09600000000000000000e+03 */
+
+ static const double
+ log2 = 6.93147180559945286227e-01; /* 0x3fe62e42fefa39ef */
+
+ GET_BITS_SP32(x, ux);
+
+ if ((ux & EXPBITS_SP32) == EXPBITS_SP32)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_SP32)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN,
+ 0, EDOM, x, 0.0F);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is infinity */
+ if (ux & SIGNBIT_SP32)
+ /* x is negative infinity. Return a NaN. */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return retval_errno_edom(x);
+#endif
+ else
+ /* Return positive infinity with no signal */
+ return x;
+ }
+ }
+ else if ((ux & SIGNBIT_SP32) || (ux < 0x3f800000))
+ {
+ /* x is less than 1.0. Return a NaN. */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return retval_errno_edom(x);
+#endif
+ }
+
+ dx = x;
+
+ if (ux > recrteps)
+ {
+ /* Arguments greater than 1/sqrt(epsilon) in magnitude are
+ approximated by acoshf(x) = ln(2) + ln(x) */
+ r = FN_PROTOTYPE(log)(dx) + log2;
+ }
+ else if (ux > 0x40000000)
+ {
+ /* 2.0 <= x <= 1/sqrt(epsilon) */
+ /* acoshf for these arguments is approximated by
+ acoshf(x) = ln(x + sqrt(x*x-1)) */
+ rarg = dx*dx-1.0;
+ /* Use assembly instruction to compute r = sqrt(rarg); */
+ ASMSQRT(rarg,r);
+ rarg = r + dx;
+ r = FN_PROTOTYPE(log)(rarg);
+ }
+ else
+ {
+ /* sqrt(epsilon) <= x <= 2.0 */
+ t = dx - 1.0;
+ rarg = 2.0*t + t*t;
+ ASMSQRT(rarg,r); /* r = sqrt(rarg) */
+ rarg = t + r;
+ r = FN_PROTOTYPE(log1p)(rarg);
+ }
+ return (float)(r);
+}
+
+weak_alias (__acoshf, acoshf)
diff --git a/src/asin.c b/src/asin.c
new file mode 100644
index 0000000..0314dd8
--- /dev/null
+++ b/src/asin.c
@@ -0,0 +1,196 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VAL_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = x;
+ exc.type = DOMAIN;
+ exc.name = (char *)"asin";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = HUGE;
+ else
+ exc.retval = nan_with_flags(AMD_F_INVALID);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("asin: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(asin)
+#endif
+
+double FN_PROTOTYPE(asin)(double x)
+{
+ /* Computes arcsin(x).
+ The argument is first reduced by noting that arcsin(x)
+ is invalid for abs(x) > 1 and arcsin(-x) = -arcsin(x).
+ For denormal and small arguments arcsin(x) = x to machine
+ accuracy. Remaining argument ranges are handled as follows.
+ For abs(x) <= 0.5 use
+ arcsin(x) = x + x^3*R(x^2)
+ where R(x^2) is a rational minimax approximation to
+ (arcsin(x) - x)/x^3.
+ For abs(x) > 0.5 exploit the identity:
+ arcsin(x) = pi/2 - 2*arcsin(sqrt(1-x)/2)
+ together with the above rational approximation, and
+ reconstruct the terms carefully.
+ */
+
+ /* Some constants and split constants. */
+
+ static const double
+ piby2_tail = 6.1232339957367660e-17, /* 0x3c91a62633145c07 */
+ hpiby2_head = 7.8539816339744831e-01, /* 0x3fe921fb54442d18 */
+ piby2 = 1.5707963267948965e+00; /* 0x3ff921fb54442d18 */
+ double u, v, y, s=0.0, r;
+ int xexp, xnan, transform=0;
+
+ unsigned long long ux, aux, xneg;
+ GET_BITS_DP64(x, ux);
+ aux = ux & ~SIGNBIT_DP64;
+ xneg = (ux & SIGNBIT_DP64);
+ xnan = (aux > PINFBITPATT_DP64);
+ xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+
+ /* Special cases */
+
+ if (xnan)
+ {
+#ifdef WINDOWS
+ return handle_error("asin", ux|0x0008000000000000, _DOMAIN,
+ 0, EDOM, x, 0.0);
+#else
+ return x + x; /* With invalid if it's a signalling NaN */
+#endif
+ }
+ else if (xexp < -28)
+ { /* y small enough that arcsin(x) = x */
+ return val_with_flags(x, AMD_F_INEXACT);
+ }
+ else if (xexp >= 0)
+ { /* abs(x) >= 1.0 */
+ if (x == 1.0)
+ return val_with_flags(piby2, AMD_F_INEXACT);
+ else if (x == -1.0)
+ return val_with_flags(-piby2, AMD_F_INEXACT);
+ else
+#ifdef WINDOWS
+ return handle_error("asin", INDEFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return retval_errno_edom(x);
+#endif
+ }
+
+ if (xneg) y = -x;
+ else y = x;
+
+ transform = (xexp >= -1); /* abs(x) >= 0.5 */
+
+ if (transform)
+ { /* Transform y into the range [0,0.5) */
+ r = 0.5*(1.0 - y);
+#ifdef WINDOWS
+ /* VC++ intrinsic call */
+ _mm_store_sd(&s, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&r)));
+#else
+ /* Hammer sqrt instruction */
+ asm volatile ("sqrtsd %1, %0" : "=x" (s) : "x" (r));
+#endif
+ y = s;
+ }
+ else
+ r = y*y;
+
+ /* Use a rational approximation for [0.0, 0.5] */
+
+ u = r*(0.227485835556935010735943483075 +
+ (-0.445017216867635649900123110649 +
+ (0.275558175256937652532686256258 +
+ (-0.0549989809235685841612020091328 +
+ (0.00109242697235074662306043804220 +
+ 0.0000482901920344786991880522822991*r)*r)*r)*r)*r)/
+ (1.36491501334161032038194214209 +
+ (-3.28431505720958658909889444194 +
+ (2.76568859157270989520376345954 +
+ (-0.943639137032492685763471240072 +
+ 0.105869422087204370341222318533*r)*r)*r)*r);
+
+ if (transform)
+ { /* Reconstruct asin carefully in transformed region */
+ {
+ double c, s1, p, q;
+ unsigned long long us;
+ GET_BITS_DP64(s, us);
+ PUT_BITS_DP64(0xffffffff00000000 & us, s1);
+ c = (r-s1*s1)/(s+s1);
+ p = 2.0*s*u - (piby2_tail-2.0*c);
+ q = hpiby2_head - 2.0*s1;
+ v = hpiby2_head - (p-q);
+ }
+ }
+ else
+ {
+#ifdef WINDOWS
+ /* Use a temporary variable to prevent VC++ rearranging
+ y + y*u
+ into
+ y * (1 + u)
+ and getting an incorrectly rounded result */
+ double tmp;
+ tmp = y * u;
+ v = y + tmp;
+#else
+ v = y + y*u;
+#endif
+ }
+
+ if (xneg) return -v;
+ else return v;
+}
+
+weak_alias (__asin, asin)
diff --git a/src/asinf.c b/src/asinf.c
new file mode 100644
index 0000000..4b42b01
--- /dev/null
+++ b/src/asinf.c
@@ -0,0 +1,190 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VALF_WITH_FLAGS
+#define USE_NANF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NANF_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x)
+{
+ struct exception exc;
+ exc.arg1 = (double)x;
+ exc.arg2 = (double)x;
+ exc.type = DOMAIN;
+ exc.name = (char *)"asinf";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = HUGE;
+ else
+ exc.retval = nanf_with_flags(AMD_F_INVALID);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("asinf: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(asinf)
+#endif
+
+float FN_PROTOTYPE(asinf)(float x)
+{
+ /* Computes arcsin(x).
+ The argument is first reduced by noting that arcsin(x)
+ is invalid for abs(x) > 1 and arcsin(-x) = -arcsin(x).
+ For denormal and small arguments arcsin(x) = x to machine
+ accuracy. Remaining argument ranges are handled as follows.
+ For abs(x) <= 0.5 use
+ arcsin(x) = x + x^3*R(x^2)
+ where R(x^2) is a rational minimax approximation to
+ (arcsin(x) - x)/x^3.
+ For abs(x) > 0.5 exploit the identity:
+ arcsin(x) = pi/2 - 2*arcsin(sqrt(1-x)/2)
+ together with the above rational approximation, and
+ reconstruct the terms carefully.
+ */
+
+ /* Some constants and split constants. */
+
+ static const float
+ piby2_tail = 7.5497894159e-08F, /* 0x33a22168 */
+ hpiby2_head = 7.8539812565e-01F, /* 0x3f490fda */
+ piby2 = 1.5707963705e+00F; /* 0x3fc90fdb */
+ float u, v, y, s = 0.0F, r;
+ int xexp, xnan, transform = 0;
+
+ unsigned int ux, aux, xneg;
+ GET_BITS_SP32(x, ux);
+ aux = ux & ~SIGNBIT_SP32;
+ xneg = (ux & SIGNBIT_SP32);
+ xnan = (aux > PINFBITPATT_SP32);
+ xexp = (int)((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+
+ /* Special cases */
+
+ if (xnan)
+ {
+#ifdef WINDOWS
+ return handle_errorf("asinf", ux|0x00400000, _DOMAIN, 0,
+ EDOM, x, 0.0F);
+#else
+ return x + x; /* With invalid if it's a signalling NaN */
+#endif
+ }
+ else if (xexp < -14)
+ /* y small enough that arcsin(x) = x */
+ return valf_with_flags(x, AMD_F_INEXACT);
+ else if (xexp >= 0)
+ {
+ /* abs(x) >= 1.0 */
+ if (x == 1.0F)
+ return valf_with_flags(piby2, AMD_F_INEXACT);
+ else if (x == -1.0F)
+ return valf_with_flags(-piby2, AMD_F_INEXACT);
+ else
+#ifdef WINDOWS
+ return handle_errorf("asinf", INDEFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return retval_errno_edom(x);
+#endif
+ }
+
+ if (xneg) y = -x;
+ else y = x;
+
+ transform = (xexp >= -1); /* abs(x) >= 0.5 */
+
+ if (transform)
+ { /* Transform y into the range [0,0.5) */
+ r = 0.5F*(1.0F - y);
+#ifdef WINDOWS
+ /* VC++ intrinsic call */
+ _mm_store_ss(&s, _mm_sqrt_ss(_mm_load_ss(&r)));
+#else
+ /* Hammer sqrt instruction */
+ asm volatile ("sqrtss %1, %0" : "=x" (s) : "x" (r));
+#endif
+ y = s;
+ }
+ else
+ r = y*y;
+
+ /* Use a rational approximation for [0.0, 0.5] */
+
+ u=r*(0.184161606965100694821398249421F +
+ (-0.0565298683201845211985026327361F +
+ (-0.0133819288943925804214011424456F -
+ 0.00396137437848476485201154797087F*r)*r)*r)/
+ (1.10496961524520294485512696706F -
+ 0.836411276854206731913362287293F*r);
+
+ if (transform)
+ {
+ /* Reconstruct asin carefully in transformed region */
+ float c, s1, p, q;
+ unsigned int us;
+ GET_BITS_SP32(s, us);
+ PUT_BITS_SP32(0xffff0000 & us, s1);
+ c = (r-s1*s1)/(s+s1);
+ p = 2.0F*s*u - (piby2_tail-2.0F*c);
+ q = hpiby2_head - 2.0F*s1;
+ v = hpiby2_head - (p-q);
+ }
+ else
+ {
+#ifdef WINDOWS
+ /* Use a temporary variable to prevent VC++ rearranging
+ y + y*u
+ into
+ y * (1 + u)
+ and getting an incorrectly rounded result */
+ float tmp;
+ tmp = y * u;
+ v = y + tmp;
+#else
+ v = y + y*u;
+#endif
+ }
+
+ if (xneg) return -v;
+ else return v;
+}
+
+weak_alias (__asinf, asinf)
diff --git a/src/asinh.c b/src/asinh.c
new file mode 100644
index 0000000..7ecde9c
--- /dev/null
+++ b/src/asinh.c
@@ -0,0 +1,322 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_HANDLE_ERROR
+#define USE_LOG_KERNEL_AMD
+#define USE_VAL_WITH_FLAGS
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERROR
+#undef USE_LOG_KERNEL_AMD
+#undef VAL_WITH_FLAGS
+
+#undef _FUNCNAME
+#define _FUNCNAME "asinh"
+double FN_PROTOTYPE(asinh)(double x)
+{
+
+ unsigned long long ux, ax, xneg;
+ double absx, r, rarg, t, r1, r2, poly, s, v1, v2;
+ int xexp;
+
+ static const unsigned long long
+ rteps = 0x3e46a09e667f3bcd, /* sqrt(eps) = 1.05367121277235086670e-08 */
+ recrteps = 0x4196a09e667f3bcd; /* 1/rteps = 9.49062656242515593767e+07 */
+
+ /* log2_lead and log2_tail sum to an extra-precise version
+ of log(2) */
+ static const double
+ log2_lead = 6.93147122859954833984e-01, /* 0x3fe62e42e0000000 */
+ log2_tail = 5.76999904754328540596e-08; /* 0x3e6efa39ef35793c */
+
+
+ GET_BITS_DP64(x, ux);
+ ax = ux & ~SIGNBIT_DP64;
+ xneg = ux & SIGNBIT_DP64;
+ PUT_BITS_DP64(ax, absx);
+
+ if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_DP64)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is infinity. Return the same infinity. */
+#ifdef WINDOWS
+ if (ux & SIGNBIT_DP64)
+ return handle_error(_FUNCNAME, NINFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+ else
+ return handle_error(_FUNCNAME, PINFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return x;
+#endif
+ }
+ }
+ else if (ax < rteps) /* abs(x) < sqrt(epsilon) */
+ {
+ if (ax == 0x0000000000000000)
+ {
+ /* x is +/-zero. Return the same zero. */
+ return x;
+ }
+ else
+ {
+ /* Tiny arguments approximated by asinh(x) = x
+ - avoid slow operations on denormalized numbers */
+ return val_with_flags(x,AMD_F_INEXACT);
+ }
+ }
+
+
+ if (ax <= 0x3ff0000000000000) /* abs(x) <= 1.0 */
+ {
+ /* Arguments less than 1.0 in magnitude are
+ approximated by [4,4] or [5,4] minimax polynomials
+ fitted to asinh series 4.6.31 (x < 1) from Abramowitz and Stegun
+ */
+ t = x*x;
+ if (ax < 0x3fd0000000000000)
+ {
+ /* [4,4] for 0 < abs(x) < 0.25 */
+ poly =
+ (-0.12845379283524906084997e0 +
+ (-0.21060688498409799700819e0 +
+ (-0.10188951822578188309186e0 +
+ (-0.13891765817243625541799e-1 -
+ 0.10324604871728082428024e-3 * t) * t) * t) * t) /
+ (0.77072275701149440164511e0 +
+ (0.16104665505597338100747e1 +
+ (0.11296034614816689554875e1 +
+ (0.30079351943799465092429e0 +
+ 0.235224464765951442265117e-1 * t) * t) * t) * t);
+ }
+ else if (ax < 0x3fe0000000000000)
+ {
+ /* [4,4] for 0.25 <= abs(x) < 0.5 */
+ poly =
+ (-0.12186605129448852495563e0 +
+ (-0.19777978436593069928318e0 +
+ (-0.94379072395062374824320e-1 +
+ (-0.12620141363821680162036e-1 -
+ 0.903396794842691998748349e-4 * t) * t) * t) * t) /
+ (0.73119630776696495279434e0 +
+ (0.15157170446881616648338e1 +
+ (0.10524909506981282725413e1 +
+ (0.27663713103600182193817e0 +
+ 0.21263492900663656707646e-1 * t) * t) * t) * t);
+ }
+ else if (ax < 0x3fe8000000000000)
+ {
+ /* [4,4] for 0.5 <= abs(x) < 0.75 */
+ poly =
+ (-0.81210026327726247622500e-1 +
+ (-0.12327355080668808750232e0 +
+ (-0.53704925162784720405664e-1 +
+ (-0.63106739048128554465450e-2 -
+ 0.35326896180771371053534e-4 * t) * t) * t) * t) /
+ (0.48726015805581794231182e0 +
+ (0.95890837357081041150936e0 +
+ (0.62322223426940387752480e0 +
+ (0.15028684818508081155141e0 +
+ 0.10302171620320141529445e-1 * t) * t) * t) * t);
+ }
+ else
+ {
+ /* [5,4] for 0.75 <= abs(x) <= 1.0 */
+ poly =
+ (-0.4638179204422665073e-1 +
+ (-0.7162729496035415183e-1 +
+ (-0.3247795155696775148e-1 +
+ (-0.4225785421291932164e-2 +
+ (-0.3808984717603160127e-4 +
+ 0.8023464184964125826e-6 * t) * t) * t) * t) * t) /
+ (0.2782907534642231184e0 +
+ (0.5549945896829343308e0 +
+ (0.3700732511330698879e0 +
+ (0.9395783438240780722e-1 +
+ 0.7200057974217143034e-2 * t) * t) * t) * t);
+ }
+ return x + x*t*poly;
+ }
+ else if (ax < 0x4040000000000000)
+ {
+ /* 1.0 <= abs(x) <= 32.0 */
+ /* Arguments in this region are approximated by various
+ minimax polynomials fitted to asinh series 4.6.31
+ in Abramowitz and Stegun.
+ */
+ t = x*x;
+ if (ax >= 0x4020000000000000)
+ {
+ /* [3,3] for 8.0 <= abs(x) <= 32.0 */
+ poly =
+ (-0.538003743384069117e-10 +
+ (-0.273698654196756169e-9 +
+ (-0.268129826956403568e-9 -
+ 0.804163374628432850e-29 * t) * t) * t) /
+ (0.238083376363471960e-9 +
+ (0.203579344621125934e-8 +
+ (0.450836980450693209e-8 +
+ 0.286005148753497156e-8 * t) * t) * t);
+ }
+ else if (ax >= 0x4010000000000000)
+ {
+ /* [4,3] for 4.0 <= abs(x) <= 8.0 */
+ poly =
+ (-0.178284193496441400e-6 +
+ (-0.928734186616614974e-6 +
+ (-0.923318925566302615e-6 +
+ (-0.776417026702577552e-19 +
+ 0.290845644810826014e-21 * t) * t) * t) * t) /
+ (0.786694697277890964e-6 +
+ (0.685435665630965488e-5 +
+ (0.153780175436788329e-4 +
+ 0.984873520613417917e-5 * t) * t) * t);
+
+ }
+ else if (ax >= 0x4000000000000000)
+ {
+ /* [5,4] for 2.0 <= abs(x) <= 4.0 */
+ poly =
+ (-0.209689451648100728e-6 +
+ (-0.219252358028695992e-5 +
+ (-0.551641756327550939e-5 +
+ (-0.382300259826830258e-5 +
+ (-0.421182121910667329e-17 +
+ 0.492236019998237684e-19 * t) * t) * t) * t) * t) /
+ (0.889178444424237735e-6 +
+ (0.131152171690011152e-4 +
+ (0.537955850185616847e-4 +
+ (0.814966175170941864e-4 +
+ 0.407786943832260752e-4 * t) * t) * t) * t);
+ }
+ else if (ax >= 0x3ff8000000000000)
+ {
+ /* [5,4] for 1.5 <= abs(x) <= 2.0 */
+ poly =
+ (-0.195436610112717345e-4 +
+ (-0.233315515113382977e-3 +
+ (-0.645380957611087587e-3 +
+ (-0.478948863920281252e-3 +
+ (-0.805234112224091742e-12 +
+ 0.246428598194879283e-13 * t) * t) * t) * t) * t) /
+ (0.822166621698664729e-4 +
+ (0.135346265620413852e-2 +
+ (0.602739242861830658e-2 +
+ (0.972227795510722956e-2 +
+ 0.510878800983771167e-2 * t) * t) * t) * t);
+ }
+ else
+ {
+ /* [5,5] for 1.0 <= abs(x) <= 1.5 */
+ poly =
+ (-0.121224194072430701e-4 +
+ (-0.273145455834305218e-3 +
+ (-0.152866982560895737e-2 +
+ (-0.292231744584913045e-2 +
+ (-0.174670900236060220e-2 -
+ 0.891754209521081538e-12 * t) * t) * t) * t) * t) /
+ (0.499426632161317606e-4 +
+ (0.139591210395547054e-2 +
+ (0.107665231109108629e-1 +
+ (0.325809818749873406e-1 +
+ (0.415222526655158363e-1 +
+ 0.186315628774716763e-1 * t) * t) * t) * t) * t);
+ }
+ log_kernel_amd64(absx, ax, &xexp, &r1, &r2);
+ r1 = ((xexp+1) * log2_lead + r1);
+ r2 = ((xexp+1) * log2_tail + r2);
+ /* Now (r1,r2) sum to log(2x). Add the term
+ 1/(2.2.x^2) = 0.25/t, and add poly/t, carefully
+ to maintain precision. (Note that we add poly/t
+ rather than poly because of the *x factor used
+ when generating the minimax polynomial) */
+ v2 = (poly+0.25)/t;
+ r = v2 + r1;
+ s = ((r1 - r) + v2) + r2;
+ v1 = r + s;
+ v2 = (r - v1) + s;
+ r = v1 + v2;
+ if (xneg)
+ return -r;
+ else
+ return r;
+ }
+ else
+ {
+ /* abs(x) > 32.0 */
+ if (ax > recrteps)
+ {
+ /* Arguments greater than 1/sqrt(epsilon) in magnitude are
+ approximated by asinh(x) = ln(2) + ln(abs(x)), with sign of x */
+ /* log_kernel_amd(x) returns xexp, r1, r2 such that
+ log(x) = xexp*log(2) + r1 + r2 */
+ log_kernel_amd64(absx, ax, &xexp, &r1, &r2);
+ /* Add (xexp+1) * log(2) to z1,z2 to get the result asinh(x).
+ The computed r1 is not subject to rounding error because
+ (xexp+1) has at most 10 significant bits, log(2) has 24 significant
+ bits, and r1 has up to 24 bits; and the exponents of r1
+ and r2 differ by at most 6. */
+ r1 = ((xexp+1) * log2_lead + r1);
+ r2 = ((xexp+1) * log2_tail + r2);
+ if (xneg)
+ return -(r1 + r2);
+ else
+ return r1 + r2;
+ }
+ else
+ {
+ rarg = absx*absx+1.0;
+ /* Arguments such that 32.0 <= abs(x) <= 1/sqrt(epsilon) are
+ approximated by
+ asinh(x) = ln(abs(x) + sqrt(x*x+1))
+ with the sign of x (see Abramowitz and Stegun 4.6.20) */
+ /* Use assembly instruction to compute r = sqrt(rarg); */
+ ASMSQRT(rarg,r);
+ r += absx;
+ GET_BITS_DP64(r, ax);
+ log_kernel_amd64(r, ax, &xexp, &r1, &r2);
+ r1 = (xexp * log2_lead + r1);
+ r2 = (xexp * log2_tail + r2);
+ if (xneg)
+ return -(r1 + r2);
+ else
+ return r1 + r2;
+ }
+ }
+}
+
+weak_alias (__asinh, asinh)
diff --git a/src/asinhf.c b/src/asinhf.c
new file mode 100644
index 0000000..f5d3bf9
--- /dev/null
+++ b/src/asinhf.c
@@ -0,0 +1,164 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include <stdio.h>
+
+#define USE_HANDLE_ERRORF
+#define USE_VALF_WITH_FLAGS
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERRORF
+#undef VALF_WITH_FLAGS
+
+#undef _FUNCNAME
+#define _FUNCNAME "asinhf"
+float FN_PROTOTYPE(asinhf)(float x)
+{
+
+ double dx;
+ unsigned int ux, ax, xneg;
+ double absx, r, rarg, t, poly;
+
+ static const unsigned int
+ rteps = 0x39800000, /* sqrt(eps) = 2.44140625000000000000e-04 */
+ recrteps = 0x46000000; /* 1/rteps = 4.09600000000000000000e+03 */
+
+ static const double
+ log2 = 6.93147180559945286227e-01; /* 0x3fe62e42fefa39ef */
+
+ GET_BITS_SP32(x, ux);
+ ax = ux & ~SIGNBIT_SP32;
+ xneg = ux & SIGNBIT_SP32;
+
+ if ((ux & EXPBITS_SP32) == EXPBITS_SP32)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_SP32)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN,
+ 0, EDOM, x, 0.0F);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is infinity. Return the same infinity. */
+#ifdef WINDOWS
+ if (ux & SIGNBIT_SP32)
+ return handle_errorf(_FUNCNAME, NINFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+ else
+ return handle_errorf(_FUNCNAME, PINFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return x;
+#endif
+ }
+ }
+ else if (ax < rteps) /* abs(x) < sqrt(epsilon) */
+ {
+ if (ax == 0x00000000)
+ {
+ /* x is +/-zero. Return the same zero. */
+ return x;
+ }
+ else
+ {
+ /* Tiny arguments approximated by asinhf(x) = x
+ - avoid slow operations on denormalized numbers */
+ return valf_with_flags(x,AMD_F_INEXACT);
+ }
+ }
+
+ dx = x;
+ if (xneg)
+ absx = -dx;
+ else
+ absx = dx;
+
+ if (ax <= 0x40800000) /* abs(x) <= 4.0 */
+ {
+ /* Arguments less than 4.0 in magnitude are
+ approximated by [4,4] minimax polynomials
+ */
+ t = dx*dx;
+ if (ax <= 0x40000000) /* abs(x) <= 2 */
+ poly =
+ (-0.1152965835871758072e-1 +
+ (-0.1480204186473758321e-1 +
+ (-0.5063201055468483248e-2 +
+ (-0.4162727710583425360e-3 -
+ 0.1177198915954942694e-5 * t) * t) * t) * t) /
+ (0.6917795026025976739e-1 +
+ (0.1199423176003939087e+0 +
+ (0.6582362487198468066e-1 +
+ (0.1260024978680227945e-1 +
+ 0.6284381367285534560e-3 * t) * t) * t) * t);
+ else
+ poly =
+ (-0.185462290695578589e-2 +
+ (-0.113672533502734019e-2 +
+ (-0.142208387300570402e-3 +
+ (-0.339546014993079977e-5 -
+ 0.151054665394480990e-8 * t) * t) * t) * t) /
+ (0.111486158580024771e-1 +
+ (0.117782437980439561e-1 +
+ (0.325903773532674833e-2 +
+ (0.255902049924065424e-3 +
+ 0.434150786948890837e-5 * t) * t) * t) * t);
+ return (float)(dx + dx*t*poly);
+ }
+ else
+ {
+ /* abs(x) > 4.0 */
+ if (ax > recrteps)
+ {
+ /* Arguments greater than 1/sqrt(epsilon) in magnitude are
+ approximated by asinhf(x) = ln(2) + ln(abs(x)), with sign of x */
+ r = FN_PROTOTYPE(log)(absx) + log2;
+ }
+ else
+ {
+ rarg = absx*absx+1.0;
+ /* Arguments such that 4.0 <= abs(x) <= 1/sqrt(epsilon) are
+ approximated by
+ asinhf(x) = ln(abs(x) + sqrt(x*x+1))
+ with the sign of x (see Abramowitz and Stegun 4.6.20) */
+ /* Use assembly instruction to compute r = sqrt(rarg); */
+ ASMSQRT(rarg,r);
+ r += absx;
+ r = FN_PROTOTYPE(log)(r);
+ }
+ if (xneg)
+ return (float)(-r);
+ else
+ return (float)r;
+ }
+}
+
+weak_alias (__asinhf, asinhf)
diff --git a/src/atan.c b/src/atan.c
new file mode 100644
index 0000000..3b99df9
--- /dev/null
+++ b/src/atan.c
@@ -0,0 +1,171 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VAL_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_VAL_WITH_FLAGS
+#undef USE_NAN_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = x;
+ exc.name = (char *)"atan";
+ exc.type = DOMAIN;
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = HUGE;
+ else
+ exc.retval = nan_with_flags(AMD_F_INVALID);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("atan: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(atan)
+#endif
+
+double FN_PROTOTYPE(atan)(double x)
+{
+
+ /* Some constants and split constants. */
+
+ static double piby2 = 1.5707963267948966e+00; /* 0x3ff921fb54442d18 */
+ double chi, clo, v, s, q, z;
+
+ /* Find properties of argument x. */
+
+ unsigned long long ux, aux, xneg;
+ GET_BITS_DP64(x, ux);
+ aux = ux & ~SIGNBIT_DP64;
+ xneg = (ux != aux);
+
+ if (xneg) v = -x;
+ else v = x;
+
+ /* Argument reduction to range [-7/16,7/16] */
+
+ if (aux < 0x3e50000000000000) /* v < 2.0^(-26) */
+ {
+ /* x is a good approximation to atan(x) and avoids working on
+ intermediate denormal numbers */
+ if (aux == 0x0000000000000000)
+ return x;
+ else
+ return val_with_flags(x, AMD_F_INEXACT);
+ }
+ else if (aux > 0x4003800000000000) /* v > 39./16. */
+ {
+
+ if (aux > PINFBITPATT_DP64)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_error("atan", ux|0x0008000000000000, _DOMAIN, 0,
+ EDOM, x, 0.0);
+#else
+ return x + x; /* Raise invalid if it's a signalling NaN */
+#endif
+ }
+ else if (aux > 0x4370000000000000)
+ { /* abs(x) > 2^56 => arctan(1/x) is
+ insignificant compared to piby2 */
+ if (xneg)
+ return val_with_flags(-piby2, AMD_F_INEXACT);
+ else
+ return val_with_flags(piby2, AMD_F_INEXACT);
+ }
+
+ x = -1.0/v;
+ /* (chi + clo) = arctan(infinity) */
+ chi = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */
+ clo = 6.12323399573676480327e-17; /* 0x3c91a62633145c06 */
+ }
+ else if (aux > 0x3ff3000000000000) /* 39./16. > v > 19./16. */
+ {
+ x = (v-1.5)/(1.0+1.5*v);
+ /* (chi + clo) = arctan(1.5) */
+ chi = 9.82793723247329054082e-01; /* 0x3fef730bd281f69b */
+ clo = 1.39033110312309953701e-17; /* 0x3c7007887af0cbbc */
+ }
+ else if (aux > 0x3fe6000000000000) /* 19./16. > v > 11./16. */
+ {
+ x = (v-1.0)/(1.0+v);
+ /* (chi + clo) = arctan(1.) */
+ chi = 7.85398163397448278999e-01; /* 0x3fe921fb54442d18 */
+ clo = 3.06161699786838240164e-17; /* 0x3c81a62633145c06 */
+ }
+ else if (aux > 0x3fdc000000000000) /* 11./16. > v > 7./16. */
+ {
+ x = (2.0*v-1.0)/(2.0+v);
+ /* (chi + clo) = arctan(0.5) */
+ chi = 4.63647609000806093515e-01; /* 0x3fddac670561bb4f */
+ clo = 2.26987774529616809294e-17; /* 0x3c7a2b7f222f65e0 */
+ }
+ else /* v < 7./16. */
+ {
+ x = v;
+ chi = 0.0;
+ clo = 0.0;
+ }
+
+ /* Core approximation: Remez(4,4) on [-7/16,7/16] */
+
+ s = x*x;
+ q = x*s*
+ (0.268297920532545909e0 +
+ (0.447677206805497472e0 +
+ (0.220638780716667420e0 +
+ (0.304455919504853031e-1 +
+ 0.142316903342317766e-3*s)*s)*s)*s)/
+ (0.804893761597637733e0 +
+ (0.182596787737507063e1 +
+ (0.141254259931958921e1 +
+ (0.424602594203847109e0 +
+ 0.389525873944742195e-1*s)*s)*s)*s);
+
+ z = chi - ((q - clo) - x);
+
+ if (xneg) z = -z;
+ return z;
+}
+
+weak_alias (__atan, atan)
diff --git a/src/atan2.c b/src/atan2.c
new file mode 100644
index 0000000..6531ee4
--- /dev/null
+++ b/src/atan2.c
@@ -0,0 +1,785 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VAL_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_SCALEDOUBLE_1
+#define USE_SCALEDOUBLE_2
+#define USE_SCALEUPDOUBLE1024
+#define USE_SCALEDOWNDOUBLE
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_VAL_WITH_FLAGS
+#undef USE_NAN_WITH_FLAGS
+#undef USE_SCALEDOUBLE_1
+#undef USE_SCALEDOUBLE_2
+#undef USE_SCALEUPDOUBLE1024
+#undef USE_SCALEDOWNDOUBLE
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range arguments
+ (only used when _LIB_VERSION is _SVID_) */
+static inline double retval_errno_edom(double x, double y)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = y;
+ exc.name = (char *)"atan2";
+ exc.type = DOMAIN;
+ exc.retval = HUGE;
+ if (!matherr(&exc))
+ {
+ (void)fputs("atan2: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(atan2)
+#endif
+
+double FN_PROTOTYPE(atan2)(double y, double x)
+{
+ /* Arrays atan_jby256_lead and atan_jby256_tail contain
+ leading and trailing parts respectively of precomputed
+ values of atan(j/256), for j = 16, 17, ..., 256.
+ atan_jby256_lead contains the first 21 bits of precision,
+ and atan_jby256_tail contains a further 53 bits precision. */
+
+ static const double atan_jby256_lead[ 241] = {
+ 6.24187886714935302734e-02, /* 0x3faff55b00000000 */
+ 6.63088560104370117188e-02, /* 0x3fb0f99e00000000 */
+ 7.01969265937805175781e-02, /* 0x3fb1f86d00000000 */
+ 7.40829110145568847656e-02, /* 0x3fb2f71900000000 */
+ 7.79666304588317871094e-02, /* 0x3fb3f59f00000000 */
+ 8.18479657173156738281e-02, /* 0x3fb4f3fd00000000 */
+ 8.57268571853637695312e-02, /* 0x3fb5f23200000000 */
+ 8.96031260490417480469e-02, /* 0x3fb6f03b00000000 */
+ 9.34767723083496093750e-02, /* 0x3fb7ee1800000000 */
+ 9.73475575447082519531e-02, /* 0x3fb8ebc500000000 */
+ 1.01215422153472900391e-01, /* 0x3fb9e94100000000 */
+ 1.05080246925354003906e-01, /* 0x3fbae68a00000000 */
+ 1.08941912651062011719e-01, /* 0x3fbbe39e00000000 */
+ 1.12800359725952148438e-01, /* 0x3fbce07c00000000 */
+ 1.16655409336090087891e-01, /* 0x3fbddd2100000000 */
+ 1.20507001876831054688e-01, /* 0x3fbed98c00000000 */
+ 1.24354958534240722656e-01, /* 0x3fbfd5ba00000000 */
+ 1.28199219703674316406e-01, /* 0x3fc068d500000000 */
+ 1.32039666175842285156e-01, /* 0x3fc0e6ad00000000 */
+ 1.35876297950744628906e-01, /* 0x3fc1646500000000 */
+ 1.39708757400512695312e-01, /* 0x3fc1e1fa00000000 */
+ 1.43537282943725585938e-01, /* 0x3fc25f6e00000000 */
+ 1.47361397743225097656e-01, /* 0x3fc2dcbd00000000 */
+ 1.51181221008300781250e-01, /* 0x3fc359e800000000 */
+ 1.54996633529663085938e-01, /* 0x3fc3d6ee00000000 */
+ 1.58807516098022460938e-01, /* 0x3fc453ce00000000 */
+ 1.62613749504089355469e-01, /* 0x3fc4d08700000000 */
+ 1.66415214538574218750e-01, /* 0x3fc54d1800000000 */
+ 1.70211911201477050781e-01, /* 0x3fc5c98100000000 */
+ 1.74003481864929199219e-01, /* 0x3fc645bf00000000 */
+ 1.77790164947509765625e-01, /* 0x3fc6c1d400000000 */
+ 1.81571602821350097656e-01, /* 0x3fc73dbd00000000 */
+ 1.85347914695739746094e-01, /* 0x3fc7b97b00000000 */
+ 1.89118742942810058594e-01, /* 0x3fc8350b00000000 */
+ 1.92884206771850585938e-01, /* 0x3fc8b06e00000000 */
+ 1.96644186973571777344e-01, /* 0x3fc92ba300000000 */
+ 2.00398445129394531250e-01, /* 0x3fc9a6a800000000 */
+ 2.04147100448608398438e-01, /* 0x3fca217e00000000 */
+ 2.07889914512634277344e-01, /* 0x3fca9c2300000000 */
+ 2.11626768112182617188e-01, /* 0x3fcb169600000000 */
+ 2.15357661247253417969e-01, /* 0x3fcb90d700000000 */
+ 2.19082474708557128906e-01, /* 0x3fcc0ae500000000 */
+ 2.22801089286804199219e-01, /* 0x3fcc84bf00000000 */
+ 2.26513504981994628906e-01, /* 0x3fccfe6500000000 */
+ 2.30219483375549316406e-01, /* 0x3fcd77d500000000 */
+ 2.33919143676757812500e-01, /* 0x3fcdf11000000000 */
+ 2.37612247467041015625e-01, /* 0x3fce6a1400000000 */
+ 2.41298794746398925781e-01, /* 0x3fcee2e100000000 */
+ 2.44978547096252441406e-01, /* 0x3fcf5b7500000000 */
+ 2.48651623725891113281e-01, /* 0x3fcfd3d100000000 */
+ 2.52317905426025390625e-01, /* 0x3fd025fa00000000 */
+ 2.55977153778076171875e-01, /* 0x3fd061ee00000000 */
+ 2.59629487991333007812e-01, /* 0x3fd09dc500000000 */
+ 2.63274669647216796875e-01, /* 0x3fd0d97e00000000 */
+ 2.66912937164306640625e-01, /* 0x3fd1151a00000000 */
+ 2.70543813705444335938e-01, /* 0x3fd1509700000000 */
+ 2.74167299270629882812e-01, /* 0x3fd18bf500000000 */
+ 2.77783632278442382812e-01, /* 0x3fd1c73500000000 */
+ 2.81392335891723632812e-01, /* 0x3fd2025500000000 */
+ 2.84993648529052734375e-01, /* 0x3fd23d5600000000 */
+ 2.88587331771850585938e-01, /* 0x3fd2783700000000 */
+ 2.92173147201538085938e-01, /* 0x3fd2b2f700000000 */
+ 2.95751571655273437500e-01, /* 0x3fd2ed9800000000 */
+ 2.99322128295898437500e-01, /* 0x3fd3281800000000 */
+ 3.02884817123413085938e-01, /* 0x3fd3627700000000 */
+ 3.06439399719238281250e-01, /* 0x3fd39cb400000000 */
+ 3.09986352920532226562e-01, /* 0x3fd3d6d100000000 */
+ 3.13524961471557617188e-01, /* 0x3fd410cb00000000 */
+ 3.17055702209472656250e-01, /* 0x3fd44aa400000000 */
+ 3.20578098297119140625e-01, /* 0x3fd4845a00000000 */
+ 3.24092388153076171875e-01, /* 0x3fd4bdee00000000 */
+ 3.27598333358764648438e-01, /* 0x3fd4f75f00000000 */
+ 3.31095933914184570312e-01, /* 0x3fd530ad00000000 */
+ 3.34585189819335937500e-01, /* 0x3fd569d800000000 */
+ 3.38066101074218750000e-01, /* 0x3fd5a2e000000000 */
+ 3.41538190841674804688e-01, /* 0x3fd5dbc300000000 */
+ 3.45002174377441406250e-01, /* 0x3fd6148400000000 */
+ 3.48457098007202148438e-01, /* 0x3fd64d1f00000000 */
+ 3.51903676986694335938e-01, /* 0x3fd6859700000000 */
+ 3.55341434478759765625e-01, /* 0x3fd6bdea00000000 */
+ 3.58770608901977539062e-01, /* 0x3fd6f61900000000 */
+ 3.62190723419189453125e-01, /* 0x3fd72e2200000000 */
+ 3.65602254867553710938e-01, /* 0x3fd7660700000000 */
+ 3.69004726409912109375e-01, /* 0x3fd79dc600000000 */
+ 3.72398376464843750000e-01, /* 0x3fd7d56000000000 */
+ 3.75782966613769531250e-01, /* 0x3fd80cd400000000 */
+ 3.79158496856689453125e-01, /* 0x3fd8442200000000 */
+ 3.82525205612182617188e-01, /* 0x3fd87b4b00000000 */
+ 3.85882616043090820312e-01, /* 0x3fd8b24d00000000 */
+ 3.89230966567993164062e-01, /* 0x3fd8e92900000000 */
+ 3.92570018768310546875e-01, /* 0x3fd91fde00000000 */
+ 3.95900011062622070312e-01, /* 0x3fd9566d00000000 */
+ 3.99220705032348632812e-01, /* 0x3fd98cd500000000 */
+ 4.02532100677490234375e-01, /* 0x3fd9c31600000000 */
+ 4.05834197998046875000e-01, /* 0x3fd9f93000000000 */
+ 4.09126996994018554688e-01, /* 0x3fda2f2300000000 */
+ 4.12410259246826171875e-01, /* 0x3fda64ee00000000 */
+ 4.15684223175048828125e-01, /* 0x3fda9a9200000000 */
+ 4.18948888778686523438e-01, /* 0x3fdad00f00000000 */
+ 4.22204017639160156250e-01, /* 0x3fdb056400000000 */
+ 4.25449609756469726562e-01, /* 0x3fdb3a9100000000 */
+ 4.28685665130615234375e-01, /* 0x3fdb6f9600000000 */
+ 4.31912183761596679688e-01, /* 0x3fdba47300000000 */
+ 4.35129165649414062500e-01, /* 0x3fdbd92800000000 */
+ 4.38336372375488281250e-01, /* 0x3fdc0db400000000 */
+ 4.41534280776977539062e-01, /* 0x3fdc421900000000 */
+ 4.44722414016723632812e-01, /* 0x3fdc765500000000 */
+ 4.47900772094726562500e-01, /* 0x3fdcaa6800000000 */
+ 4.51069593429565429688e-01, /* 0x3fdcde5300000000 */
+ 4.54228639602661132812e-01, /* 0x3fdd121500000000 */
+ 4.57377910614013671875e-01, /* 0x3fdd45ae00000000 */
+ 4.60517644882202148438e-01, /* 0x3fdd791f00000000 */
+ 4.63647603988647460938e-01, /* 0x3fddac6700000000 */
+ 4.66767549514770507812e-01, /* 0x3fdddf8500000000 */
+ 4.69877958297729492188e-01, /* 0x3fde127b00000000 */
+ 4.72978591918945312500e-01, /* 0x3fde454800000000 */
+ 4.76069211959838867188e-01, /* 0x3fde77eb00000000 */
+ 4.79150056838989257812e-01, /* 0x3fdeaa6500000000 */
+ 4.82221126556396484375e-01, /* 0x3fdedcb600000000 */
+ 4.85282421112060546875e-01, /* 0x3fdf0ede00000000 */
+ 4.88333940505981445312e-01, /* 0x3fdf40dd00000000 */
+ 4.91375446319580078125e-01, /* 0x3fdf72b200000000 */
+ 4.94406938552856445312e-01, /* 0x3fdfa45d00000000 */
+ 4.97428894042968750000e-01, /* 0x3fdfd5e000000000 */
+ 5.00440597534179687500e-01, /* 0x3fe0039c00000000 */
+ 5.03442764282226562500e-01, /* 0x3fe01c3400000000 */
+ 5.06434917449951171875e-01, /* 0x3fe034b700000000 */
+ 5.09417057037353515625e-01, /* 0x3fe04d2500000000 */
+ 5.12389183044433593750e-01, /* 0x3fe0657e00000000 */
+ 5.15351772308349609375e-01, /* 0x3fe07dc300000000 */
+ 5.18304347991943359375e-01, /* 0x3fe095f300000000 */
+ 5.21246910095214843750e-01, /* 0x3fe0ae0e00000000 */
+ 5.24179458618164062500e-01, /* 0x3fe0c61400000000 */
+ 5.27101993560791015625e-01, /* 0x3fe0de0500000000 */
+ 5.30014991760253906250e-01, /* 0x3fe0f5e200000000 */
+ 5.32917976379394531250e-01, /* 0x3fe10daa00000000 */
+ 5.35810947418212890625e-01, /* 0x3fe1255d00000000 */
+ 5.38693904876708984375e-01, /* 0x3fe13cfb00000000 */
+ 5.41567325592041015625e-01, /* 0x3fe1548500000000 */
+ 5.44430732727050781250e-01, /* 0x3fe16bfa00000000 */
+ 5.47284126281738281250e-01, /* 0x3fe1835a00000000 */
+ 5.50127506256103515625e-01, /* 0x3fe19aa500000000 */
+ 5.52961349487304687500e-01, /* 0x3fe1b1dc00000000 */
+ 5.55785179138183593750e-01, /* 0x3fe1c8fe00000000 */
+ 5.58598995208740234375e-01, /* 0x3fe1e00b00000000 */
+ 5.61403274536132812500e-01, /* 0x3fe1f70400000000 */
+ 5.64197540283203125000e-01, /* 0x3fe20de800000000 */
+ 5.66981792449951171875e-01, /* 0x3fe224b700000000 */
+ 5.69756031036376953125e-01, /* 0x3fe23b7100000000 */
+ 5.72520732879638671875e-01, /* 0x3fe2521700000000 */
+ 5.75275897979736328125e-01, /* 0x3fe268a900000000 */
+ 5.78021049499511718750e-01, /* 0x3fe27f2600000000 */
+ 5.80756187438964843750e-01, /* 0x3fe2958e00000000 */
+ 5.83481788635253906250e-01, /* 0x3fe2abe200000000 */
+ 5.86197376251220703125e-01, /* 0x3fe2c22100000000 */
+ 5.88903427124023437500e-01, /* 0x3fe2d84c00000000 */
+ 5.91599464416503906250e-01, /* 0x3fe2ee6200000000 */
+ 5.94285964965820312500e-01, /* 0x3fe3046400000000 */
+ 5.96962928771972656250e-01, /* 0x3fe31a5200000000 */
+ 5.99629878997802734375e-01, /* 0x3fe3302b00000000 */
+ 6.02287292480468750000e-01, /* 0x3fe345f000000000 */
+ 6.04934692382812500000e-01, /* 0x3fe35ba000000000 */
+ 6.07573032379150390625e-01, /* 0x3fe3713d00000000 */
+ 6.10201358795166015625e-01, /* 0x3fe386c500000000 */
+ 6.12820148468017578125e-01, /* 0x3fe39c3900000000 */
+ 6.15428924560546875000e-01, /* 0x3fe3b19800000000 */
+ 6.18028640747070312500e-01, /* 0x3fe3c6e400000000 */
+ 6.20618820190429687500e-01, /* 0x3fe3dc1c00000000 */
+ 6.23198986053466796875e-01, /* 0x3fe3f13f00000000 */
+ 6.25770092010498046875e-01, /* 0x3fe4064f00000000 */
+ 6.28331184387207031250e-01, /* 0x3fe41b4a00000000 */
+ 6.30883216857910156250e-01, /* 0x3fe4303200000000 */
+ 6.33425712585449218750e-01, /* 0x3fe4450600000000 */
+ 6.35958671569824218750e-01, /* 0x3fe459c600000000 */
+ 6.38482093811035156250e-01, /* 0x3fe46e7200000000 */
+ 6.40995979309082031250e-01, /* 0x3fe4830a00000000 */
+ 6.43500804901123046875e-01, /* 0x3fe4978f00000000 */
+ 6.45996093750000000000e-01, /* 0x3fe4ac0000000000 */
+ 6.48482322692871093750e-01, /* 0x3fe4c05e00000000 */
+ 6.50959014892578125000e-01, /* 0x3fe4d4a800000000 */
+ 6.53426170349121093750e-01, /* 0x3fe4e8de00000000 */
+ 6.55884265899658203125e-01, /* 0x3fe4fd0100000000 */
+ 6.58332824707031250000e-01, /* 0x3fe5111000000000 */
+ 6.60772323608398437500e-01, /* 0x3fe5250c00000000 */
+ 6.63202762603759765625e-01, /* 0x3fe538f500000000 */
+ 6.65623664855957031250e-01, /* 0x3fe54cca00000000 */
+ 6.68035984039306640625e-01, /* 0x3fe5608d00000000 */
+ 6.70438766479492187500e-01, /* 0x3fe5743c00000000 */
+ 6.72832489013671875000e-01, /* 0x3fe587d800000000 */
+ 6.75216674804687500000e-01, /* 0x3fe59b6000000000 */
+ 6.77592277526855468750e-01, /* 0x3fe5aed600000000 */
+ 6.79958820343017578125e-01, /* 0x3fe5c23900000000 */
+ 6.82316303253173828125e-01, /* 0x3fe5d58900000000 */
+ 6.84664726257324218750e-01, /* 0x3fe5e8c600000000 */
+ 6.87004089355468750000e-01, /* 0x3fe5fbf000000000 */
+ 6.89334869384765625000e-01, /* 0x3fe60f0800000000 */
+ 6.91656589508056640625e-01, /* 0x3fe6220d00000000 */
+ 6.93969249725341796875e-01, /* 0x3fe634ff00000000 */
+ 6.96272850036621093750e-01, /* 0x3fe647de00000000 */
+ 6.98567867279052734375e-01, /* 0x3fe65aab00000000 */
+ 7.00854301452636718750e-01, /* 0x3fe66d6600000000 */
+ 7.03131675720214843750e-01, /* 0x3fe6800e00000000 */
+ 7.05400466918945312500e-01, /* 0x3fe692a400000000 */
+ 7.07660198211669921875e-01, /* 0x3fe6a52700000000 */
+ 7.09911346435546875000e-01, /* 0x3fe6b79800000000 */
+ 7.12153911590576171875e-01, /* 0x3fe6c9f700000000 */
+ 7.14387893676757812500e-01, /* 0x3fe6dc4400000000 */
+ 7.16613292694091796875e-01, /* 0x3fe6ee7f00000000 */
+ 7.18829631805419921875e-01, /* 0x3fe700a700000000 */
+ 7.21037864685058593750e-01, /* 0x3fe712be00000000 */
+ 7.23237514495849609375e-01, /* 0x3fe724c300000000 */
+ 7.25428581237792968750e-01, /* 0x3fe736b600000000 */
+ 7.27611064910888671875e-01, /* 0x3fe7489700000000 */
+ 7.29785442352294921875e-01, /* 0x3fe75a6700000000 */
+ 7.31950759887695312500e-01, /* 0x3fe76c2400000000 */
+ 7.34108448028564453125e-01, /* 0x3fe77dd100000000 */
+ 7.36257076263427734375e-01, /* 0x3fe78f6b00000000 */
+ 7.38397598266601562500e-01, /* 0x3fe7a0f400000000 */
+ 7.40530014038085937500e-01, /* 0x3fe7b26c00000000 */
+ 7.42654323577880859375e-01, /* 0x3fe7c3d300000000 */
+ 7.44770050048828125000e-01, /* 0x3fe7d52800000000 */
+ 7.46877670288085937500e-01, /* 0x3fe7e66c00000000 */
+ 7.48976707458496093750e-01, /* 0x3fe7f79e00000000 */
+ 7.51068115234375000000e-01, /* 0x3fe808c000000000 */
+ 7.53150939941406250000e-01, /* 0x3fe819d000000000 */
+ 7.55226135253906250000e-01, /* 0x3fe82ad000000000 */
+ 7.57292747497558593750e-01, /* 0x3fe83bbe00000000 */
+ 7.59351730346679687500e-01, /* 0x3fe84c9c00000000 */
+ 7.61402606964111328125e-01, /* 0x3fe85d6900000000 */
+ 7.63445377349853515625e-01, /* 0x3fe86e2500000000 */
+ 7.65480041503906250000e-01, /* 0x3fe87ed000000000 */
+ 7.67507076263427734375e-01, /* 0x3fe88f6b00000000 */
+ 7.69526004791259765625e-01, /* 0x3fe89ff500000000 */
+ 7.71537303924560546875e-01, /* 0x3fe8b06f00000000 */
+ 7.73540973663330078125e-01, /* 0x3fe8c0d900000000 */
+ 7.75536537170410156250e-01, /* 0x3fe8d13200000000 */
+ 7.77523994445800781250e-01, /* 0x3fe8e17a00000000 */
+ 7.79504299163818359375e-01, /* 0x3fe8f1b300000000 */
+ 7.81476497650146484375e-01, /* 0x3fe901db00000000 */
+ 7.83441066741943359375e-01, /* 0x3fe911f300000000 */
+ 7.85398006439208984375e-01}; /* 0x3fe921fb00000000 */
+
+ static const double atan_jby256_tail[ 241] = {
+ 2.13244638182005395671e-08, /* 0x3e56e59fbd38db2c */
+ 3.89093864761712760656e-08, /* 0x3e64e3aa54dedf96 */
+ 4.44780900009437454576e-08, /* 0x3e67e105ab1bda88 */
+ 1.15344768460112754160e-08, /* 0x3e48c5254d013fd0 */
+ 3.37271051945395312705e-09, /* 0x3e2cf8ab3ad62670 */
+ 2.40857608736109859459e-08, /* 0x3e59dca4bec80468 */
+ 1.85853810450623807768e-08, /* 0x3e53f4b5ec98a8da */
+ 5.14358299969225078306e-08, /* 0x3e6b9d49619d81fe */
+ 8.85023985412952486748e-09, /* 0x3e43017887460934 */
+ 1.59425154214358432060e-08, /* 0x3e511e3eca0b9944 */
+ 1.95139937737755753164e-08, /* 0x3e54f3f73c5a332e */
+ 2.64909755273544319715e-08, /* 0x3e5c71c8ae0e00a6 */
+ 4.43388037881231070144e-08, /* 0x3e67cde0f86fbdc7 */
+ 2.14757072421821274557e-08, /* 0x3e570f328c889c72 */
+ 2.61049792670754218852e-08, /* 0x3e5c07ae9b994efe */
+ 7.81439350674466302231e-09, /* 0x3e40c8021d7b1698 */
+ 3.60125207123751024094e-08, /* 0x3e635585edb8cb22 */
+ 6.15276238179343767917e-08, /* 0x3e70842567b30e96 */
+ 9.54387964641184285058e-08, /* 0x3e799e811031472e */
+ 3.02789566851502754129e-08, /* 0x3e6041821416bcee */
+ 1.16888650949870856331e-07, /* 0x3e7f6086e4dc96f4 */
+ 1.07580956468653338863e-08, /* 0x3e471a535c5f1b58 */
+ 8.33454265379535427653e-08, /* 0x3e765f743fe63ca1 */
+ 1.10790279272629526068e-07, /* 0x3e7dbd733472d014 */
+ 1.08394277896366207424e-07, /* 0x3e7d18cc4d8b0d1d */
+ 9.22176086126841098800e-08, /* 0x3e78c12553c8fb29 */
+ 7.90938592199048786990e-08, /* 0x3e753b49e2e8f991 */
+ 8.66445407164293125637e-08, /* 0x3e77422ae148c141 */
+ 1.40839973537092438671e-08, /* 0x3e4e3ec269df56a8 */
+ 1.19070438507307600689e-07, /* 0x3e7ff6754e7e0ac9 */
+ 6.40451663051716197071e-08, /* 0x3e7131267b1b5aad */
+ 1.08338682076343674522e-07, /* 0x3e7d14fa403a94bc */
+ 3.52999550187922736222e-08, /* 0x3e62f396c089a3d8 */
+ 1.05983273930043077202e-07, /* 0x3e7c731d78fa95bb */
+ 1.05486124078259553339e-07, /* 0x3e7c50f385177399 */
+ 5.82167732281776477773e-08, /* 0x3e6f41409c6f2c20 */
+ 1.08696483983403942633e-07, /* 0x3e7d2d90c4c39ec0 */
+ 4.47335086122377542835e-08, /* 0x3e680420696f2106 */
+ 1.26896287162615723528e-08, /* 0x3e4b40327943a2e8 */
+ 4.06534471589151404531e-08, /* 0x3e65d35e02f3d2a2 */
+ 3.84504846300557026690e-08, /* 0x3e64a498288117b0 */
+ 3.60715006404807269080e-08, /* 0x3e635da119afb324 */
+ 6.44725903165522722801e-08, /* 0x3e714e85cdb9a908 */
+ 3.63749249976409461305e-08, /* 0x3e638754e5547b9a */
+ 1.03901294413833913794e-07, /* 0x3e7be40ae6ce3246 */
+ 6.25379756302167880580e-08, /* 0x3e70c993b3bea7e7 */
+ 6.63984302368488828029e-08, /* 0x3e71d2dd89ac3359 */
+ 3.21844598971548278059e-08, /* 0x3e61476603332c46 */
+ 1.16030611712765830905e-07, /* 0x3e7f25901bac55b7 */
+ 1.17464622142347730134e-07, /* 0x3e7f881b7c826e28 */
+ 7.54604017965808996596e-08, /* 0x3e7441996d698d20 */
+ 1.49234929356206556899e-07, /* 0x3e8407ac521ea089 */
+ 1.41416924523217430259e-07, /* 0x3e82fb0c6c4b1723 */
+ 2.13308065617483489011e-07, /* 0x3e8ca135966a3e18 */
+ 5.04230937933302320146e-08, /* 0x3e6b1218e4d646e4 */
+ 5.45874922281655519035e-08, /* 0x3e6d4e72a350d288 */
+ 1.51849028914786868886e-07, /* 0x3e84617e2f04c329 */
+ 3.09004308703769273010e-08, /* 0x3e6096ec41e82650 */
+ 9.67574548184738317664e-08, /* 0x3e79f91f25773e6e */
+ 4.02508285529322212824e-08, /* 0x3e659c0820f1d674 */
+ 3.01222268096861091157e-08, /* 0x3e602bf7a2df1064 */
+ 2.36189860670079288680e-07, /* 0x3e8fb36bfc40508f */
+ 1.14095158111080887695e-07, /* 0x3e7ea08f3f8dc892 */
+ 7.42349089746573467487e-08, /* 0x3e73ed6254656a0e */
+ 5.12515583196230380184e-08, /* 0x3e6b83f5e5e69c58 */
+ 2.19290391828763918102e-07, /* 0x3e8d6ec2af768592 */
+ 3.83263512187553886471e-08, /* 0x3e6493889a226f94 */
+ 1.61513486284090523855e-07, /* 0x3e85ad8fa65279ba */
+ 5.09996743535589922261e-08, /* 0x3e6b615784d45434 */
+ 1.23694037861246766534e-07, /* 0x3e809a184368f145 */
+ 8.23367955351123783984e-08, /* 0x3e761a2439b0d91c */
+ 1.07591766213053694014e-07, /* 0x3e7ce1a65e39a978 */
+ 1.42789947524631815640e-07, /* 0x3e832a39a93b6a66 */
+ 1.32347123024711878538e-07, /* 0x3e81c3699af804e7 */
+ 2.17626067316598149229e-08, /* 0x3e575e0f4e44ede8 */
+ 2.34454866923044288656e-07, /* 0x3e8f77ced1a7a83b */
+ 2.82966370261766916053e-09, /* 0x3e284e7f0cb1b500 */
+ 2.29300919890907632975e-07, /* 0x3e8ec6b838b02dfe */
+ 1.48428270450261284915e-07, /* 0x3e83ebf4dfbeda87 */
+ 1.87937408574313982512e-07, /* 0x3e89397aed9cb475 */
+ 6.13685946813334055347e-08, /* 0x3e707937bc239c54 */
+ 1.98585022733583817493e-07, /* 0x3e8aa754553131b6 */
+ 7.68394131623752961662e-08, /* 0x3e74a05d407c45dc */
+ 1.28119052312436745644e-07, /* 0x3e8132231a206dd0 */
+ 7.02119104719236502733e-08, /* 0x3e72d8ecfdd69c88 */
+ 9.87954793820636301943e-08, /* 0x3e7a852c74218606 */
+ 1.72176752381034986217e-07, /* 0x3e871bf2baeebb50 */
+ 1.12877225146169704119e-08, /* 0x3e483d7db7491820 */
+ 5.33549829555851737993e-08, /* 0x3e6ca50d92b6da14 */
+ 2.13833275710816521345e-08, /* 0x3e56f5cde8530298 */
+ 1.16243518048290556393e-07, /* 0x3e7f343198910740 */
+ 6.29926408369055877943e-08, /* 0x3e70e8d241ccd80a */
+ 6.45429039328021963791e-08, /* 0x3e71535ac619e6c8 */
+ 8.64001922814281933403e-08, /* 0x3e77316041c36cd2 */
+ 9.50767572202325800240e-08, /* 0x3e7985a000637d8e */
+ 5.80851497508121135975e-08, /* 0x3e6f2f29858c0a68 */
+ 1.82350561135024766232e-07, /* 0x3e8879847f96d909 */
+ 1.98948680587390608655e-07, /* 0x3e8ab3d319e12e42 */
+ 7.83548663450197659846e-08, /* 0x3e75088162dfc4c2 */
+ 3.04374234486798594427e-08, /* 0x3e605749a1cd9d8c */
+ 2.76135725629797411787e-08, /* 0x3e5da65c6c6b8618 */
+ 4.32610105454203065470e-08, /* 0x3e6739bf7df1ad64 */
+ 5.17107515324127256994e-08, /* 0x3e6bc31252aa3340 */
+ 2.82398327875841444660e-08, /* 0x3e5e528191ad3aa8 */
+ 1.87482469524195595399e-07, /* 0x3e8929d93df19f18 */
+ 2.97481891662714096139e-08, /* 0x3e5ff11eb693a080 */
+ 9.94421570843584316402e-09, /* 0x3e455ae3f145a3a0 */
+ 1.07056210730391848428e-07, /* 0x3e7cbcd8c6c0ca82 */
+ 6.25589580466881163081e-08, /* 0x3e70cb04d425d304 */
+ 9.56641013869464593803e-08, /* 0x3e79adfcab5be678 */
+ 1.88056307148355440276e-07, /* 0x3e893d90c5662508 */
+ 8.38850689379557880950e-08, /* 0x3e768489bd35ff40 */
+ 5.01215865527674122924e-09, /* 0x3e3586ed3da2b7e0 */
+ 1.74166095998522089762e-07, /* 0x3e87604d2e850eee */
+ 9.96779574395363585849e-08, /* 0x3e7ac1d12bfb53d8 */
+ 5.98432026368321460686e-09, /* 0x3e39b3d468274740 */
+ 1.18362922366887577169e-07, /* 0x3e7fc5d68d10e53c */
+ 1.86086833284154215946e-07, /* 0x3e88f9e51884becb */
+ 1.97671457251348941011e-07, /* 0x3e8a87f0869c06d1 */
+ 1.42447160717199237159e-07, /* 0x3e831e7279f685fa */
+ 1.05504240785546574184e-08, /* 0x3e46a8282f9719b0 */
+ 3.13335218371639189324e-08, /* 0x3e60d2724a8a44e0 */
+ 1.96518418901914535399e-07, /* 0x3e8a60524b11ad4e */
+ 2.17692035039173536059e-08, /* 0x3e575fdf832750f0 */
+ 2.15613114426529981675e-07, /* 0x3e8cf06902e4cd36 */
+ 5.68271098300441214948e-08, /* 0x3e6e82422d4f6d10 */
+ 1.70331455823369124256e-08, /* 0x3e524a091063e6c0 */
+ 9.17590028095709583247e-08, /* 0x3e78a1a172dc6f38 */
+ 2.77266304112916566247e-07, /* 0x3e929b6619f8a92d */
+ 9.37041937614656939690e-08, /* 0x3e79274d9c1b70c8 */
+ 1.56116346368316796511e-08, /* 0x3e50c34b1fbb7930 */
+ 4.13967433808382727413e-08, /* 0x3e6639866c20eb50 */
+ 1.70164749185821616276e-07, /* 0x3e86d6d0f6832e9e */
+ 4.01708788545600086008e-07, /* 0x3e9af54def99f25e */
+ 2.59663539226050551563e-07, /* 0x3e916cfc52a00262 */
+ 2.22007487655027469542e-07, /* 0x3e8dcc1e83569c32 */
+ 2.90542250809644081369e-07, /* 0x3e937f7a551ed425 */
+ 4.67720537666628903341e-07, /* 0x3e9f6360adc98887 */
+ 2.79799803956772554802e-07, /* 0x3e92c6ec8d35a2c1 */
+ 2.07344552327432547723e-07, /* 0x3e8bd44df84cb036 */
+ 2.54705698692735196368e-07, /* 0x3e9117cf826e310e */
+ 4.26848589539548450728e-07, /* 0x3e9ca533f332cfc9 */
+ 2.52506723633552216197e-07, /* 0x3e90f208509dbc2e */
+ 2.14684129933849704964e-07, /* 0x3e8cd07d93c945de */
+ 3.20134822201596505431e-07, /* 0x3e957bdfd67e6d72 */
+ 9.93537565749855712134e-08, /* 0x3e7aab89c516c658 */
+ 3.70792944827917252327e-08, /* 0x3e63e823b1a1b8a0 */
+ 1.41772749369083698972e-07, /* 0x3e8307464a9d6d3c */
+ 4.22446601490198804306e-07, /* 0x3e9c5993cd438843 */
+ 4.11818433724801511540e-07, /* 0x3e9ba2fca02ab554 */
+ 1.19976381502605310519e-07, /* 0x3e801a5b6983a268 */
+ 3.43703078571520905265e-08, /* 0x3e6273d1b350efc8 */
+ 1.66128705555453270379e-07, /* 0x3e864c238c37b0c6 */
+ 5.00499610023283006540e-08, /* 0x3e6aded07370a300 */
+ 1.75105139941208062123e-07, /* 0x3e878091197eb47e */
+ 7.70807146729030327334e-08, /* 0x3e74b0f245e0dabc */
+ 2.45918607526895836121e-07, /* 0x3e9080d9794e2eaf */
+ 2.18359020958626199345e-07, /* 0x3e8d4ec242b60c76 */
+ 8.44342887976445333569e-09, /* 0x3e4221d2f940caa0 */
+ 1.07506148687888629299e-07, /* 0x3e7cdbc42b2bba5c */
+ 5.36544954316820904572e-08, /* 0x3e6cce37bb440840 */
+ 3.39109101518396596341e-07, /* 0x3e96c1d999cf1dd0 */
+ 2.60098720293920613340e-08, /* 0x3e5bed8a07eb0870 */
+ 8.42678991664621455827e-08, /* 0x3e769ed88f490e3c */
+ 5.36972237470183633197e-08, /* 0x3e6cd41719b73ef0 */
+ 4.28192558171921681288e-07, /* 0x3e9cbc4ac95b41b7 */
+ 2.71535491483955143294e-07, /* 0x3e9238f1b890f5d7 */
+ 7.84094998145075780203e-08, /* 0x3e750c4282259cc4 */
+ 3.43880599134117431863e-07, /* 0x3e9713d2de87b3e2 */
+ 1.32878065060366481043e-07, /* 0x3e81d5a7d2255276 */
+ 4.18046802627967629428e-07, /* 0x3e9c0dfd48227ac1 */
+ 2.65042411765766019424e-07, /* 0x3e91c964dab76753 */
+ 1.70383695347518643694e-07, /* 0x3e86de56d5704496 */
+ 1.54096497259613515678e-07, /* 0x3e84aeb71fd19968 */
+ 2.36543402412459813461e-07, /* 0x3e8fbf91c57b1918 */
+ 4.38416350106876736790e-07, /* 0x3e9d6bef7fbe5d9a */
+ 3.03892161339927775731e-07, /* 0x3e9464d3dc249066 */
+ 3.31136771605664899240e-07, /* 0x3e9638e2ec4d9073 */
+ 6.49494294526590682218e-08, /* 0x3e716f4a7247ea7c */
+ 4.10423429887181345747e-09, /* 0x3e31a0a740f1d440 */
+ 1.70831640869113847224e-07, /* 0x3e86edbb0114a33c */
+ 1.10811512657909180966e-07, /* 0x3e7dbee8bf1d513c */
+ 3.23677724749783611964e-07, /* 0x3e95b8bdb0248f73 */
+ 3.55662734259192678528e-07, /* 0x3e97de3d3f5eac64 */
+ 2.30102333489738219140e-07, /* 0x3e8ee24187ae448a */
+ 4.47429004000738629714e-07, /* 0x3e9e06c591ec5192 */
+ 7.78167135617329598659e-08, /* 0x3e74e3861a332738 */
+ 9.90345291908535415737e-08, /* 0x3e7a9599dcc2bfe4 */
+ 5.85800913143113728314e-08, /* 0x3e6f732fbad43468 */
+ 4.57859062410871843857e-07, /* 0x3e9eb9f573b727d9 */
+ 3.67993069723390929794e-07, /* 0x3e98b212a2eb9897 */
+ 2.90836464322977276043e-07, /* 0x3e9384884c167215 */
+ 2.51621574250131388318e-07, /* 0x3e90e2d363020051 */
+ 2.75789824740652815545e-07, /* 0x3e92820879fbd022 */
+ 3.88985776250314403593e-07, /* 0x3e9a1ab9893e4b30 */
+ 1.40214080183768019611e-07, /* 0x3e82d1b817a24478 */
+ 3.23451432223550478373e-08, /* 0x3e615d7b8ded4878 */
+ 9.15979180730608444470e-08, /* 0x3e78968f9db3a5e4 */
+ 3.44371402498640470421e-07, /* 0x3e971c4171fe135f */
+ 3.40401897215059498077e-07, /* 0x3e96d80f605d0d8c */
+ 1.06431813453707950243e-07, /* 0x3e7c91f043691590 */
+ 1.46204238932338846248e-07, /* 0x3e839f8a15fce2b2 */
+ 9.94610376972039046878e-09, /* 0x3e455beda9d94b80 */
+ 2.01711528092681771039e-07, /* 0x3e8b12c15d60949a */
+ 2.72027977986191568296e-07, /* 0x3e924167b312bfe3 */
+ 2.48402602511693757964e-07, /* 0x3e90ab8633070277 */
+ 1.58480011219249621715e-07, /* 0x3e854554ebbc80ee */
+ 3.00372828113368713281e-08, /* 0x3e60204aef5a4bb8 */
+ 3.67816204583541976394e-07, /* 0x3e98af08c679cf2c */
+ 2.46169793032343824291e-07, /* 0x3e90852a330ae6c8 */
+ 1.70080468270204253247e-07, /* 0x3e86d3eb9ec32916 */
+ 1.67806717763872914315e-07, /* 0x3e8685cb7fcbbafe */
+ 2.67715622006907942620e-07, /* 0x3e91f751c1e0bd95 */
+ 2.14411342550299170574e-08, /* 0x3e5705b1b0f72560 */
+ 4.11228221283669073277e-07, /* 0x3e9b98d8d808ca92 */
+ 3.52311752396749662260e-08, /* 0x3e62ea22c75cc980 */
+ 3.52718000397367821054e-07, /* 0x3e97aba62bca0350 */
+ 4.38857387992911129814e-07, /* 0x3e9d73833442278c */
+ 3.22574606753482540743e-07, /* 0x3e95a5ca1fb18bf9 */
+ 3.28730371182804296828e-08, /* 0x3e61a6092b6ecf28 */
+ 7.56672470607639279700e-08, /* 0x3e744fd049aac104 */
+ 3.26750155316369681821e-09, /* 0x3e2c114fd8df5180 */
+ 3.21724445362095284743e-07, /* 0x3e95972f130feae5 */
+ 1.06639427371776571151e-07, /* 0x3e7ca034a55fe198 */
+ 3.41020788139524715063e-07, /* 0x3e96e2b149990227 */
+ 1.00582838631232552824e-07, /* 0x3e7b00000294592c */
+ 3.68439433859276640065e-07, /* 0x3e98b9bdc442620e */
+ 2.20403078342388012027e-07, /* 0x3e8d94fdfabf3e4e */
+ 1.62841467098298142534e-07, /* 0x3e85db30b145ad9a */
+ 2.25325348296680733838e-07, /* 0x3e8e3e1eb95022b0 */
+ 4.37462238226421614339e-07, /* 0x3e9d5b8b45442bd6 */
+ 3.52055880555040706500e-07, /* 0x3e97a046231ecd2e */
+ 4.75614398494781776825e-07, /* 0x3e9feafe3ef55232 */
+ 3.60998399033215317516e-07, /* 0x3e9839e7bfd78267 */
+ 3.79292434611513945954e-08, /* 0x3e645cf49d6fa900 */
+ 1.29859015528549300061e-08, /* 0x3e4be3132b27f380 */
+ 3.15927546985474913188e-07, /* 0x3e9533980bb84f9f */
+ 2.28533679887379668031e-08, /* 0x3e5889e2ce3ba390 */
+ 1.17222541823553133877e-07, /* 0x3e7f7778c3ad0cc8 */
+ 1.51991208405464415857e-07, /* 0x3e846660cec4eba2 */
+ 1.56958239325240655564e-07}; /* 0x3e85110b4611a626 */
+
+ /* Some constants and split constants. */
+
+ static double pi = 3.1415926535897932e+00, /* 0x400921fb54442d18 */
+ piby2 = 1.5707963267948966e+00, /* 0x3ff921fb54442d18 */
+ piby4 = 7.8539816339744831e-01, /* 0x3fe921fb54442d18 */
+ three_piby4 = 2.3561944901923449e+00, /* 0x4002d97c7f3321d2 */
+ pi_head = 3.1415926218032836e+00, /* 0x400921fb50000000 */
+ pi_tail = 3.1786509547056392e-08, /* 0x3e6110b4611a6263 */
+ piby2_head = 1.5707963267948965e+00, /* 0x3ff921fb54442d18 */
+ piby2_tail = 6.1232339957367660e-17; /* 0x3c91a62633145c07 */
+
+ double u, v, vbyu, q1, q2, s, u1, vu1, u2, vu2, uu, c, r;
+ unsigned int swap_vu, index, xzero, yzero, xnan, ynan, xinf, yinf;
+ int m, xexp, yexp, diffexp;
+
+ /* Find properties of arguments x and y. */
+
+ unsigned long long ux, ui, aux, xneg, uy, auy, yneg;
+
+ GET_BITS_DP64(x, ux);
+ GET_BITS_DP64(y, uy);
+ aux = ux & ~SIGNBIT_DP64;
+ auy = uy & ~SIGNBIT_DP64;
+ xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ yexp = (int)((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ xneg = ux & SIGNBIT_DP64;
+ yneg = uy & SIGNBIT_DP64;
+ xzero = (aux == 0);
+ yzero = (auy == 0);
+ xnan = (aux > PINFBITPATT_DP64);
+ ynan = (auy > PINFBITPATT_DP64);
+ xinf = (aux == PINFBITPATT_DP64);
+ yinf = (auy == PINFBITPATT_DP64);
+
+ diffexp = yexp - xexp;
+
+ /* Special cases */
+
+ if (xnan)
+#ifdef WINDOWS
+ return handle_error("atan2", ux|0x0008000000000000, _DOMAIN, 0,
+ EDOM, x, y);
+#else
+ return x + x; /* Raise invalid if it's a signalling NaN */
+#endif
+ else if (ynan)
+#ifdef WINDOWS
+ return handle_error("atan2", uy|0x0008000000000000, _DOMAIN, 0,
+ EDOM, x, y);
+#else
+ return y + y; /* Raise invalid if it's a signalling NaN */
+#endif
+ else if (yzero)
+ { /* Zero y gives +-0 for positive x
+ and +-pi for negative x */
+#ifndef WINDOWS
+ if ((_LIB_VERSION == _SVID_) && xzero)
+ /* Sigh - _SVID_ defines atan2(0,0) as a domain error */
+ return retval_errno_edom(x, y);
+ else
+#endif
+ if (xneg)
+ {
+ if (yneg) return val_with_flags(-pi,AMD_F_INEXACT);
+ else return val_with_flags(pi,AMD_F_INEXACT);
+ }
+ else return y;
+ }
+ else if (xzero)
+ { /* Zero x gives +- pi/2
+ depending on sign of y */
+ if (yneg) return val_with_flags(-piby2,AMD_F_INEXACT);
+ else val_with_flags(piby2,AMD_F_INEXACT);
+ }
+
+ /* Scale up both x and y if they are both below 1/4.
+ This avoids any possible later denormalised arithmetic. */
+
+ if ((xexp < 1021 && yexp < 1021))
+ {
+ scaleUpDouble1024(ux, &ux);
+ scaleUpDouble1024(uy, &uy);
+ PUT_BITS_DP64(ux, x);
+ PUT_BITS_DP64(uy, y);
+ xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ yexp = (int)((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ diffexp = yexp - xexp;
+ }
+
+ if (diffexp > 56)
+ { /* abs(y)/abs(x) > 2^56 => arctan(x/y)
+ is insignificant compared to piby2 */
+ if (yneg) return val_with_flags(-piby2,AMD_F_INEXACT);
+ else return val_with_flags(piby2,AMD_F_INEXACT);
+ }
+ else if (diffexp < -28 && (!xneg))
+ { /* x positive and dominant over y by a factor of 2^28.
+ In this case atan(y/x) is y/x to machine accuracy. */
+
+ if (diffexp < -1074) /* Result underflows */
+ {
+ if (yneg)
+ return val_with_flags(-0.0,AMD_F_INEXACT | AMD_F_UNDERFLOW);
+ else
+ return val_with_flags(0.0,AMD_F_INEXACT | AMD_F_UNDERFLOW);
+ }
+ else
+ {
+ if (diffexp < -1022)
+ {
+ /* Result will likely be denormalized */
+ y = scaleDouble_1(y, 100);
+ y /= x;
+ /* Now y is 2^100 times the true result. Scale it back down. */
+ GET_BITS_DP64(y, uy);
+ scaleDownDouble(uy, 100, &uy);
+ PUT_BITS_DP64(uy, y);
+ if ((uy & EXPBITS_DP64) == 0)
+ return val_with_flags(y, AMD_F_INEXACT | AMD_F_UNDERFLOW);
+ else
+ return y;
+ }
+ else
+ return y / x;
+ }
+ }
+ else if (diffexp < -56 && xneg)
+ { /* abs(x)/abs(y) > 2^56 and x < 0 => arctan(y/x)
+ is insignificant compared to pi */
+ if (yneg) return val_with_flags(-pi,AMD_F_INEXACT);
+ else return val_with_flags(pi,AMD_F_INEXACT);
+ }
+ else if (yinf && xinf)
+ { /* If abs(x) and abs(y) are both infinity
+ return +-pi/4 or +- 3pi/4 according to
+ signs. */
+ if (xneg)
+ {
+ if (yneg) return val_with_flags(-three_piby4,AMD_F_INEXACT);
+ else return val_with_flags(three_piby4,AMD_F_INEXACT);
+ }
+ else
+ {
+ if (yneg) return val_with_flags(-piby4,AMD_F_INEXACT);
+ else return val_with_flags(piby4,AMD_F_INEXACT);
+ }
+ }
+
+ /* General case: take absolute values of arguments */
+
+ u = x; v = y;
+ if (xneg) u = -x;
+ if (yneg) v = -y;
+
+ /* Swap u and v if necessary to obtain 0 < v < u. Compute v/u. */
+
+ swap_vu = (u < v);
+ if (swap_vu) { uu = u; u = v; v = uu; }
+ vbyu = v/u;
+
+ if (vbyu > 0.0625)
+ { /* General values of v/u. Use a look-up
+ table and series expansion. */
+
+ index = (int)(256*vbyu + 0.5);
+ q1 = atan_jby256_lead[index-16];
+ q2 = atan_jby256_tail[index-16];
+ c = index*1./256;
+ GET_BITS_DP64(u, ui);
+ m = (int)((ui & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+ u = scaleDouble_2(u,-m);
+ v = scaleDouble_2(v,-m);
+ GET_BITS_DP64(u, ui);
+ PUT_BITS_DP64(0xfffffffff8000000 & ui, u1); /* 26 leading bits of u */
+ u2 = u - u1;
+
+ r = ((v-c*u1)-c*u2)/(u+c*v);
+
+ /* Polynomial approximation to atan(r) */
+
+ s = r*r;
+ q2 = q2 + r - r*(s * (0.33333333333224095522 - s*(0.19999918038989143496)));
+ }
+ else if (vbyu < 1.e-8)
+ { /* v/u is small enough that atan(v/u) = v/u */
+ q1 = 0.0;
+ q2 = vbyu;
+ }
+ else /* vbyu <= 0.0625 */
+ {
+ /* Small values of v/u. Use a series expansion
+ computed carefully to minimise cancellation */
+
+ GET_BITS_DP64(u, ui);
+ PUT_BITS_DP64(0xffffffff00000000 & ui, u1);
+ GET_BITS_DP64(vbyu, ui);
+ PUT_BITS_DP64(0xffffffff00000000 & ui, vu1);
+ u2 = u - u1;
+ vu2 = vbyu - vu1;
+
+ q1 = 0.0;
+ s = vbyu*vbyu;
+ q2 = vbyu +
+ ((((v - u1*vu1) - u2*vu1) - u*vu2)/u -
+ (vbyu*s*(0.33333333333333170500 -
+ s*(0.19999999999393223405 -
+ s*(0.14285713561807169030 -
+ s*(0.11110736283514525407 -
+ s*(0.90029810285449784439E-01)))))));
+ }
+
+ /* Tidy-up according to which quadrant the arguments lie in */
+
+ if (swap_vu) {q1 = piby2_head - q1; q2 = piby2_tail - q2;}
+ if (xneg) {q1 = pi_head - q1; q2 = pi_tail - q2;}
+ q1 = q1 + q2;
+
+ if (yneg) q1 = - q1;
+
+ return q1;
+}
+
+weak_alias (__atan2, atan2)
diff --git a/src/atan2f.c b/src/atan2f.c
new file mode 100644
index 0000000..9b53c6f
--- /dev/null
+++ b/src/atan2f.c
@@ -0,0 +1,500 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VALF_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_SCALEDOUBLE_1
+#define USE_SCALEDOWNDOUBLE
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_VALF_WITH_FLAGS
+#undef USE_NAN_WITH_FLAGS
+#undef USE_SCALEDOUBLE_1
+#undef USE_SCALEDOWNDOUBLE
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range arguments
+ (only used when _LIB_VERSION is _SVID_) */
+static inline float retval_errno_edom(float x, float y)
+{
+ struct exception exc;
+ exc.arg1 = (double)x;
+ exc.arg2 = (double)y;
+ exc.type = DOMAIN;
+ exc.name = (char *)"atan2f";
+ exc.retval = HUGE;
+ if (!matherr(&exc))
+ {
+ (void)fputs("atan2f: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(atan2f)
+#endif
+
+float FN_PROTOTYPE(atan2f)(float fy, float fx)
+{
+ /* Array atan_jby256 contains precomputed values of atan(j/256),
+ for j = 16, 17, ..., 256. */
+
+ static const double atan_jby256[ 241] = {
+ 6.24188099959573430842e-02, /* 0x3faff55bb72cfde9 */
+ 6.63088949198234745008e-02, /* 0x3fb0f99ea71d52a6 */
+ 7.01969710718705064423e-02, /* 0x3fb1f86dbf082d58 */
+ 7.40829225490337306415e-02, /* 0x3fb2f719318a4a9a */
+ 7.79666338315423007588e-02, /* 0x3fb3f59f0e7c559d */
+ 8.18479898030765457007e-02, /* 0x3fb4f3fd677292fb */
+ 8.57268757707448092464e-02, /* 0x3fb5f2324fd2d7b2 */
+ 8.96031774848717321724e-02, /* 0x3fb6f03bdcea4b0c */
+ 9.34767811585894559112e-02, /* 0x3fb7ee182602f10e */
+ 9.73475734872236708739e-02, /* 0x3fb8ebc54478fb28 */
+ 1.01215441667466668485e-01, /* 0x3fb9e94153cfdcf1 */
+ 1.05080273416329528224e-01, /* 0x3fbae68a71c722b8 */
+ 1.08941956989865793015e-01, /* 0x3fbbe39ebe6f07c3 */
+ 1.12800381201659388752e-01, /* 0x3fbce07c5c3cca32 */
+ 1.16655435441069349478e-01, /* 0x3fbddd21701eba6e */
+ 1.20507009691224548087e-01, /* 0x3fbed98c2190043a */
+ 1.24354994546761424279e-01, /* 0x3fbfd5ba9aac2f6d */
+ 1.28199281231298117811e-01, /* 0x3fc068d584212b3d */
+ 1.32039761614638734288e-01, /* 0x3fc0e6adccf40881 */
+ 1.35876328229701304195e-01, /* 0x3fc1646541060850 */
+ 1.39708874289163620386e-01, /* 0x3fc1e1fafb043726 */
+ 1.43537293701821222491e-01, /* 0x3fc25f6e171a535c */
+ 1.47361481088651630200e-01, /* 0x3fc2dcbdb2fba1ff */
+ 1.51181331798580037562e-01, /* 0x3fc359e8edeb99a3 */
+ 1.54996741923940972718e-01, /* 0x3fc3d6eee8c6626c */
+ 1.58807608315631065832e-01, /* 0x3fc453cec6092a9e */
+ 1.62613828597948567589e-01, /* 0x3fc4d087a9da4f17 */
+ 1.66415301183114927586e-01, /* 0x3fc54d18ba11570a */
+ 1.70211925285474380276e-01, /* 0x3fc5c9811e3ec269 */
+ 1.74003600935367680469e-01, /* 0x3fc645bfffb3aa73 */
+ 1.77790228992676047071e-01, /* 0x3fc6c1d4898933d8 */
+ 1.81571711160032150945e-01, /* 0x3fc73dbde8a7d201 */
+ 1.85347949995694760705e-01, /* 0x3fc7b97b4bce5b02 */
+ 1.89118848926083965578e-01, /* 0x3fc8350be398ebc7 */
+ 1.92884312257974643856e-01, /* 0x3fc8b06ee2879c28 */
+ 1.96644245190344985064e-01, /* 0x3fc92ba37d050271 */
+ 2.00398553825878511514e-01, /* 0x3fc9a6a8e96c8626 */
+ 2.04147145182116990236e-01, /* 0x3fca217e601081a5 */
+ 2.07889927202262986272e-01, /* 0x3fca9c231b403279 */
+ 2.11626808765629753628e-01, /* 0x3fcb1696574d780b */
+ 2.15357699697738047551e-01, /* 0x3fcb90d7529260a2 */
+ 2.19082510780057748701e-01, /* 0x3fcc0ae54d768466 */
+ 2.22801153759394493514e-01, /* 0x3fcc84bf8a742e6d */
+ 2.26513541356919617664e-01, /* 0x3fccfe654e1d5395 */
+ 2.30219587276843717927e-01, /* 0x3fcd77d5df205736 */
+ 2.33919206214733416127e-01, /* 0x3fcdf110864c9d9d */
+ 2.37612313865471241892e-01, /* 0x3fce6a148e96ec4d */
+ 2.41298826930858800743e-01, /* 0x3fcee2e1451d980c */
+ 2.44978663126864143473e-01, /* 0x3fcf5b75f92c80dd */
+ 2.48651741190513253521e-01, /* 0x3fcfd3d1fc40dbe4 */
+ 2.52317980886427151166e-01, /* 0x3fd025fa510665b5 */
+ 2.55977303013005474952e-01, /* 0x3fd061eea03d6290 */
+ 2.59629629408257511791e-01, /* 0x3fd09dc597d86362 */
+ 2.63274882955282396590e-01, /* 0x3fd0d97ee509acb3 */
+ 2.66912987587400396539e-01, /* 0x3fd1151a362431c9 */
+ 2.70543868292936529052e-01, /* 0x3fd150973a9ce546 */
+ 2.74167451119658789338e-01, /* 0x3fd18bf5a30bf178 */
+ 2.77783663178873208022e-01, /* 0x3fd1c735212dd883 */
+ 2.81392432649178403370e-01, /* 0x3fd2025567e47c95 */
+ 2.84993688779881237938e-01, /* 0x3fd23d562b381041 */
+ 2.88587361894077354396e-01, /* 0x3fd278372057ef45 */
+ 2.92173383391398755471e-01, /* 0x3fd2b2f7fd9b5fe2 */
+ 2.95751685750431536626e-01, /* 0x3fd2ed987a823cfe */
+ 2.99322202530807379706e-01, /* 0x3fd328184fb58951 */
+ 3.02884868374971361060e-01, /* 0x3fd362773707ebcb */
+ 3.06439619009630070945e-01, /* 0x3fd39cb4eb76157b */
+ 3.09986391246883430384e-01, /* 0x3fd3d6d129271134 */
+ 3.13525122985043869228e-01, /* 0x3fd410cbad6c7d32 */
+ 3.17055753209146973237e-01, /* 0x3fd44aa436c2af09 */
+ 3.20578221991156986359e-01, /* 0x3fd4845a84d0c21b */
+ 3.24092470489871664618e-01, /* 0x3fd4bdee586890e6 */
+ 3.27598440950530811477e-01, /* 0x3fd4f75f73869978 */
+ 3.31096076704132047386e-01, /* 0x3fd530ad9951cd49 */
+ 3.34585322166458920545e-01, /* 0x3fd569d88e1b4cd7 */
+ 3.38066122836825466713e-01, /* 0x3fd5a2e0175e0f4e */
+ 3.41538425296541714449e-01, /* 0x3fd5dbc3fbbe768d */
+ 3.45002177207105076295e-01, /* 0x3fd614840309cfe1 */
+ 3.48457327308122011278e-01, /* 0x3fd64d1ff635c1c5 */
+ 3.51903825414964732676e-01, /* 0x3fd685979f5fa6fd */
+ 3.55341622416168290144e-01, /* 0x3fd6bdeac9cbd76c */
+ 3.58770670270572189509e-01, /* 0x3fd6f61941e4def0 */
+ 3.62190922004212156882e-01, /* 0x3fd72e22d53aa2a9 */
+ 3.65602331706966821034e-01, /* 0x3fd7660752817501 */
+ 3.69004854528964421068e-01, /* 0x3fd79dc6899118d1 */
+ 3.72398446676754202311e-01, /* 0x3fd7d5604b63b3f7 */
+ 3.75783065409248884237e-01, /* 0x3fd80cd46a14b1d0 */
+ 3.79158669033441808605e-01, /* 0x3fd84422b8df95d7 */
+ 3.82525216899905096124e-01, /* 0x3fd87b4b0c1ebedb */
+ 3.85882669398073752109e-01, /* 0x3fd8b24d394a1b25 */
+ 3.89230987951320717144e-01, /* 0x3fd8e92916f5cde8 */
+ 3.92570135011828580396e-01, /* 0x3fd91fde7cd0c662 */
+ 3.95900074055262896078e-01, /* 0x3fd9566d43a34907 */
+ 3.99220769575252543149e-01, /* 0x3fd98cd5454d6b18 */
+ 4.02532187077682512832e-01, /* 0x3fd9c3165cc58107 */
+ 4.05834293074804064450e-01, /* 0x3fd9f93066168001 */
+ 4.09127055079168300278e-01, /* 0x3fda2f233e5e530b */
+ 4.12410441597387267265e-01, /* 0x3fda64eec3cc23fc */
+ 4.15684422123729413467e-01, /* 0x3fda9a92d59e98cf */
+ 4.18948967133552840902e-01, /* 0x3fdad00f5422058b */
+ 4.22204048076583571270e-01, /* 0x3fdb056420ae9343 */
+ 4.25449637370042266227e-01, /* 0x3fdb3a911da65c6c */
+ 4.28685708391625730496e-01, /* 0x3fdb6f962e737efb */
+ 4.31912235472348193799e-01, /* 0x3fdba473378624a5 */
+ 4.35129193889246812521e-01, /* 0x3fdbd9281e528191 */
+ 4.38336559857957774877e-01, /* 0x3fdc0db4c94ec9ef */
+ 4.41534310525166673322e-01, /* 0x3fdc42191ff11eb6 */
+ 4.44722423960939305942e-01, /* 0x3fdc76550aad71f8 */
+ 4.47900879150937292206e-01, /* 0x3fdcaa6872f3631b */
+ 4.51069655988523443568e-01, /* 0x3fdcde53432c1350 */
+ 4.54228735266762495559e-01, /* 0x3fdd121566b7f2ad */
+ 4.57378098670320809571e-01, /* 0x3fdd45aec9ec862b */
+ 4.60517728767271039558e-01, /* 0x3fdd791f5a1226f4 */
+ 4.63647609000806093515e-01, /* 0x3fddac670561bb4f */
+ 4.66767723680866497560e-01, /* 0x3fdddf85bb026974 */
+ 4.69878057975686880265e-01, /* 0x3fde127b6b0744af */
+ 4.72978597903265574054e-01, /* 0x3fde4548066cf51a */
+ 4.76069330322761219421e-01, /* 0x3fde77eb7f175a34 */
+ 4.79150242925822533735e-01, /* 0x3fdeaa65c7cf28c4 */
+ 4.82221324227853687105e-01, /* 0x3fdedcb6d43f8434 */
+ 4.85282563559221225002e-01, /* 0x3fdf0ede98f393cf */
+ 4.88333951056405479729e-01, /* 0x3fdf40dd0b541417 */
+ 4.91375477653101910835e-01, /* 0x3fdf72b221a4e495 */
+ 4.94407135071275316562e-01, /* 0x3fdfa45dd3029258 */
+ 4.97428915812172245392e-01, /* 0x3fdfd5e0175fdf83 */
+ 5.00440813147294050189e-01, /* 0x3fe0039c73c1a40b */
+ 5.03442821109336358099e-01, /* 0x3fe01c341e82422d */
+ 5.06434934483096732549e-01, /* 0x3fe034b709250488 */
+ 5.09417148796356245022e-01, /* 0x3fe04d25314342e5 */
+ 5.12389460310737621107e-01, /* 0x3fe0657e94db30cf */
+ 5.15351866012543347040e-01, /* 0x3fe07dc3324e9b38 */
+ 5.18304363603577900044e-01, /* 0x3fe095f30861a58f */
+ 5.21246951491958210312e-01, /* 0x3fe0ae0e1639866c */
+ 5.24179628782913242802e-01, /* 0x3fe0c6145b5b43da */
+ 5.27102395269579471204e-01, /* 0x3fe0de05d7aa6f7c */
+ 5.30015251423793132268e-01, /* 0x3fe0f5e28b67e295 */
+ 5.32918198386882147055e-01, /* 0x3fe10daa77307a0d */
+ 5.35811237960463593311e-01, /* 0x3fe1255d9bfbd2a8 */
+ 5.38694372597246617929e-01, /* 0x3fe13cfbfb1b056e */
+ 5.41567605391844897333e-01, /* 0x3fe1548596376469 */
+ 5.44430940071603086672e-01, /* 0x3fe16bfa6f5137e1 */
+ 5.47284380987436924748e-01, /* 0x3fe1835a88be7c13 */
+ 5.50127933104692989907e-01, /* 0x3fe19aa5e5299f99 */
+ 5.52961601994028217888e-01, /* 0x3fe1b1dc87904284 */
+ 5.55785393822313511514e-01, /* 0x3fe1c8fe7341f64f */
+ 5.58599315343562330405e-01, /* 0x3fe1e00babdefeb3 */
+ 5.61403373889889367732e-01, /* 0x3fe1f7043557138a */
+ 5.64197577362497537656e-01, /* 0x3fe20de813e823b1 */
+ 5.66981934222700489912e-01, /* 0x3fe224b74c1d192a */
+ 5.69756453482978431069e-01, /* 0x3fe23b71e2cc9e6a */
+ 5.72521144698072359525e-01, /* 0x3fe25217dd17e501 */
+ 5.75276017956117824426e-01, /* 0x3fe268a940696da6 */
+ 5.78021083869819540801e-01, /* 0x3fe27f261273d1b3 */
+ 5.80756353567670302596e-01, /* 0x3fe2958e59308e30 */
+ 5.83481838685214859730e-01, /* 0x3fe2abe21aded073 */
+ 5.86197551356360535557e-01, /* 0x3fe2c2215e024465 */
+ 5.88903504204738026395e-01, /* 0x3fe2d84c2961e48b */
+ 5.91599710335111383941e-01, /* 0x3fe2ee628406cbca */
+ 5.94286183324841177367e-01, /* 0x3fe30464753b090a */
+ 5.96962937215401501234e-01, /* 0x3fe31a52048874be */
+ 5.99629986503951384336e-01, /* 0x3fe3302b39b78856 */
+ 6.02287346134964152178e-01, /* 0x3fe345f01cce37bb */
+ 6.04935031491913965951e-01, /* 0x3fe35ba0b60eccce */
+ 6.07573058389022313541e-01, /* 0x3fe3713d0df6c503 */
+ 6.10201443063065118722e-01, /* 0x3fe386c52d3db11e */
+ 6.12820202165241245673e-01, /* 0x3fe39c391cd41719 */
+ 6.15429352753104952356e-01, /* 0x3fe3b198e5e2564a */
+ 6.18028912282561737612e-01, /* 0x3fe3c6e491c78dc4 */
+ 6.20618898599929469384e-01, /* 0x3fe3dc1c2a188504 */
+ 6.23199329934065904268e-01, /* 0x3fe3f13fb89e96f4 */
+ 6.25770224888563042498e-01, /* 0x3fe4064f47569f48 */
+ 6.28331602434009650615e-01, /* 0x3fe41b4ae06fea41 */
+ 6.30883481900321840818e-01, /* 0x3fe430328e4b26d5 */
+ 6.33425882969144482537e-01, /* 0x3fe445065b795b55 */
+ 6.35958825666321447834e-01, /* 0x3fe459c652badc7f */
+ 6.38482330354437466191e-01, /* 0x3fe46e727efe4715 */
+ 6.40996417725432032775e-01, /* 0x3fe4830aeb5f7bfd */
+ 6.43501108793284370968e-01, /* 0x3fe4978fa3269ee1 */
+ 6.45996424886771558604e-01, /* 0x3fe4ac00b1c71762 */
+ 6.48482387642300484032e-01, /* 0x3fe4c05e22de94e4 */
+ 6.50959018996812410762e-01, /* 0x3fe4d4a8023414e8 */
+ 6.53426341180761927063e-01, /* 0x3fe4e8de5bb6ec04 */
+ 6.55884376711170835605e-01, /* 0x3fe4fd013b7dd17e */
+ 6.58333148384755983962e-01, /* 0x3fe51110adc5ed81 */
+ 6.60772679271132590273e-01, /* 0x3fe5250cbef1e9fa */
+ 6.63202992706093175102e-01, /* 0x3fe538f57b89061e */
+ 6.65624112284960989250e-01, /* 0x3fe54ccaf0362c8f */
+ 6.68036061856020157990e-01, /* 0x3fe5608d29c70c34 */
+ 6.70438865514021320458e-01, /* 0x3fe5743c352b33b9 */
+ 6.72832547593763097282e-01, /* 0x3fe587d81f732fba */
+ 6.75217132663749830535e-01, /* 0x3fe59b60f5cfab9d */
+ 6.77592645519925151909e-01, /* 0x3fe5aed6c5909517 */
+ 6.79959111179481823228e-01, /* 0x3fe5c2399c244260 */
+ 6.82316554874748071313e-01, /* 0x3fe5d58987169b18 */
+ 6.84665002047148862907e-01, /* 0x3fe5e8c6941043cf */
+ 6.87004478341244895212e-01, /* 0x3fe5fbf0d0d5cc49 */
+ 6.89335009598845749323e-01, /* 0x3fe60f084b46e05e */
+ 6.91656621853199760075e-01, /* 0x3fe6220d115d7b8d */
+ 6.93969341323259825138e-01, /* 0x3fe634ff312d1f3b */
+ 6.96273194408023488045e-01, /* 0x3fe647deb8e20b8f */
+ 6.98568207680949848637e-01, /* 0x3fe65aabb6c07b02 */
+ 7.00854407884450081312e-01, /* 0x3fe66d663923e086 */
+ 7.03131821924453670469e-01, /* 0x3fe6800e4e7e2857 */
+ 7.05400476865049030906e-01, /* 0x3fe692a40556fb6a */
+ 7.07660399923197958039e-01, /* 0x3fe6a5276c4b0575 */
+ 7.09911618463524796141e-01, /* 0x3fe6b798920b3d98 */
+ 7.12154159993178659249e-01, /* 0x3fe6c9f7855c3198 */
+ 7.14388052156768926793e-01, /* 0x3fe6dc44551553ae */
+ 7.16613322731374569052e-01, /* 0x3fe6ee7f10204aef */
+ 7.18829999621624415873e-01, /* 0x3fe700a7c5784633 */
+ 7.21038110854851588272e-01, /* 0x3fe712be84295198 */
+ 7.23237684576317874097e-01, /* 0x3fe724c35b4fae7b */
+ 7.25428749044510712274e-01, /* 0x3fe736b65a172dff */
+ 7.27611332626510676214e-01, /* 0x3fe748978fba8e0f */
+ 7.29785463793429123314e-01, /* 0x3fe75a670b82d8d8 */
+ 7.31951171115916565668e-01, /* 0x3fe76c24dcc6c6c0 */
+ 7.34108483259739652560e-01, /* 0x3fe77dd112ea22c7 */
+ 7.36257428981428097003e-01, /* 0x3fe78f6bbd5d315e */
+ 7.38398037123989547936e-01, /* 0x3fe7a0f4eb9c19a2 */
+ 7.40530336612692630105e-01, /* 0x3fe7b26cad2e50fd */
+ 7.42654356450917929600e-01, /* 0x3fe7c3d311a6092b */
+ 7.44770125716075148681e-01, /* 0x3fe7d528289fa093 */
+ 7.46877673555587429099e-01, /* 0x3fe7e66c01c114fd */
+ 7.48977029182941400620e-01, /* 0x3fe7f79eacb97898 */
+ 7.51068221873802288613e-01, /* 0x3fe808c03940694a */
+ 7.53151280962194302759e-01, /* 0x3fe819d0b7158a4c */
+ 7.55226235836744863583e-01, /* 0x3fe82ad036000005 */
+ 7.57293115936992444759e-01, /* 0x3fe83bbec5cdee22 */
+ 7.59351950749757920178e-01, /* 0x3fe84c9c7653f7ea */
+ 7.61402769805578416573e-01, /* 0x3fe85d69576cc2c5 */
+ 7.63445602675201784315e-01, /* 0x3fe86e2578f87ae5 */
+ 7.65480478966144461950e-01, /* 0x3fe87ed0eadc5a2a */
+ 7.67507428319308182552e-01, /* 0x3fe88f6bbd023118 */
+ 7.69526480405658186434e-01, /* 0x3fe89ff5ff57f1f7 */
+ 7.71537664922959498526e-01, /* 0x3fe8b06fc1cf3dfe */
+ 7.73541011592573490852e-01, /* 0x3fe8c0d9145cf49d */
+ 7.75536550156311621507e-01, /* 0x3fe8d13206f8c4ca */
+ 7.77524310373347682379e-01, /* 0x3fe8e17aa99cc05d */
+ 7.79504322017186335181e-01, /* 0x3fe8f1b30c44f167 */
+ 7.81476614872688268854e-01, /* 0x3fe901db3eeef187 */
+ 7.83441218733151756304e-01, /* 0x3fe911f35199833b */
+ 7.85398163397448278999e-01}; /* 0x3fe921fb54442d18 */
+
+ /* Some constants. */
+
+ static double pi = 3.1415926535897932e+00, /* 0x400921fb54442d18 */
+ piby2 = 1.5707963267948966e+00, /* 0x3ff921fb54442d18 */
+ piby4 = 7.8539816339744831e-01, /* 0x3fe921fb54442d18 */
+ three_piby4 = 2.3561944901923449e+00; /* 0x4002d97c7f3321d2 */
+
+ double u, v, vbyu, q, s, uu, r;
+ unsigned int swap_vu, index, xzero, yzero, xnan, ynan, xinf, yinf;
+ int xexp, yexp, diffexp;
+
+ double x = fx;
+ double y = fy;
+
+ /* Find properties of arguments x and y. */
+
+ unsigned long long ux, aux, xneg, uy, auy, yneg;
+
+ GET_BITS_DP64(x, ux);
+ GET_BITS_DP64(y, uy);
+ aux = ux & ~SIGNBIT_DP64;
+ auy = uy & ~SIGNBIT_DP64;
+ xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ yexp = (int)((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ xneg = ux & SIGNBIT_DP64;
+ yneg = uy & SIGNBIT_DP64;
+ xzero = (aux == 0);
+ yzero = (auy == 0);
+ xnan = (aux > PINFBITPATT_DP64);
+ ynan = (auy > PINFBITPATT_DP64);
+ xinf = (aux == PINFBITPATT_DP64);
+ yinf = (auy == PINFBITPATT_DP64);
+
+ diffexp = yexp - xexp;
+
+ /* Special cases */
+
+ if (xnan)
+#ifdef WINDOWS
+ {
+ unsigned int ufx;
+ GET_BITS_SP32(fx, ufx);
+ return handle_errorf("atan2f", ufx|0x00400000, _DOMAIN, 0, EDOM, fx, fy);
+ }
+#else
+ return fx + fx; /* Raise invalid if it's a signalling NaN */
+#endif
+ else if (ynan)
+#ifdef WINDOWS
+ {
+ unsigned int ufy;
+ GET_BITS_SP32(fy, ufy);
+ return handle_errorf("atan2f", ufy|0x00400000, _DOMAIN, 0, EDOM, fx, fy);
+ }
+#else
+ return (float)(y + y); /* Raise invalid if it's a signalling NaN */
+#endif
+ else if (yzero)
+ { /* Zero y gives +-0 for positive x
+ and +-pi for negative x */
+#ifndef WINDOWS
+ if ((_LIB_VERSION == _SVID_) && xzero)
+ /* Sigh - _SVID_ defines atan2(0,0) as a domain error */
+ return retval_errno_edom(x, y);
+ else
+#endif
+ if (xneg)
+ {
+ if (yneg) return valf_with_flags((float)-pi, AMD_F_INEXACT);
+ else return valf_with_flags((float)pi, AMD_F_INEXACT);
+ }
+ else return (float)y;
+ }
+ else if (xzero)
+ { /* Zero x gives +- pi/2
+ depending on sign of y */
+ if (yneg) return valf_with_flags((float)-piby2, AMD_F_INEXACT);
+ else valf_with_flags((float)piby2, AMD_F_INEXACT);
+ }
+
+ if (diffexp > 26)
+ { /* abs(y)/abs(x) > 2^26 => arctan(x/y)
+ is insignificant compared to piby2 */
+ if (yneg) return valf_with_flags((float)-piby2, AMD_F_INEXACT);
+ else return valf_with_flags((float)piby2, AMD_F_INEXACT);
+ }
+ else if (diffexp < -13 && (!xneg))
+ { /* x positive and dominant over y by a factor of 2^13.
+ In this case atan(y/x) is y/x to machine accuracy. */
+
+ if (diffexp < -150) /* Result underflows */
+ {
+ if (yneg)
+ return valf_with_flags(-0.0F, AMD_F_INEXACT | AMD_F_UNDERFLOW);
+ else
+ return valf_with_flags(0.0F, AMD_F_INEXACT | AMD_F_UNDERFLOW);
+ }
+ else
+ {
+ if (diffexp < -126)
+ {
+ /* Result will likely be denormalized */
+ y = scaleDouble_1(y, 100);
+ y /= x;
+ /* Now y is 2^100 times the true result. Scale it back down. */
+ GET_BITS_DP64(y, uy);
+ scaleDownDouble(uy, 100, &uy);
+ PUT_BITS_DP64(uy, y);
+ if ((uy & EXPBITS_DP64) == 0)
+ return valf_with_flags((float)y, AMD_F_INEXACT | AMD_F_UNDERFLOW);
+ else
+ return (float)y;
+ }
+ else
+ return (float)(y / x);
+ }
+ }
+ else if (diffexp < -26 && xneg)
+ { /* abs(x)/abs(y) > 2^56 and x < 0 => arctan(y/x)
+ is insignificant compared to pi */
+ if (yneg) return valf_with_flags((float)-pi, AMD_F_INEXACT);
+ else return valf_with_flags((float)pi, AMD_F_INEXACT);
+ }
+ else if (yinf && xinf)
+ { /* If abs(x) and abs(y) are both infinity
+ return +-pi/4 or +- 3pi/4 according to
+ signs. */
+ if (xneg)
+ {
+ if (yneg) return valf_with_flags((float)-three_piby4, AMD_F_INEXACT);
+ else return valf_with_flags((float)three_piby4, AMD_F_INEXACT);
+ }
+ else
+ {
+ if (yneg) return valf_with_flags((float)-piby4, AMD_F_INEXACT);
+ else return valf_with_flags((float)piby4, AMD_F_INEXACT);
+ }
+ }
+
+ /* General case: take absolute values of arguments */
+
+ u = x; v = y;
+ if (xneg) u = -x;
+ if (yneg) v = -y;
+
+ /* Swap u and v if necessary to obtain 0 < v < u. Compute v/u. */
+
+ swap_vu = (u < v);
+ if (swap_vu) { uu = u; u = v; v = uu; }
+ vbyu = v/u;
+
+ if (vbyu > 0.0625)
+ { /* General values of v/u. Use a look-up
+ table and series expansion. */
+
+ index = (int)(256*vbyu + 0.5);
+ r = (256*v-index*u)/(256*u+index*v);
+
+ /* Polynomial approximation to atan(vbyu) */
+
+ s = r*r;
+ q = atan_jby256[index-16] + r - r*s*0.33333333333224095522;
+ }
+ else if (vbyu < 1.e-4)
+ { /* v/u is small enough that atan(v/u) = v/u */
+ q = vbyu;
+ }
+ else /* vbyu <= 0.0625 */
+ {
+ /* Small values of v/u. Use a series expansion */
+
+ s = vbyu*vbyu;
+ q = vbyu -
+ vbyu*s*(0.33333333333333170500 -
+ s*(0.19999999999393223405 -
+ s*0.14285713561807169030));
+ }
+
+ /* Tidy-up according to which quadrant the arguments lie in */
+
+ if (swap_vu) {q = piby2 - q;}
+ if (xneg) {q = pi - q;}
+ if (yneg) q = - q;
+ return (float)q;
+}
+
+weak_alias (__atan2f, atan2f)
diff --git a/src/atanf.c b/src/atanf.c
new file mode 100644
index 0000000..567dd87
--- /dev/null
+++ b/src/atanf.c
@@ -0,0 +1,170 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_VALF_WITH_FLAGS
+#define USE_NAN_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_VALF_WITH_FLAGS
+#undef USE_NAN_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x)
+{
+ struct exception exc;
+ exc.arg1 = (float)x;
+ exc.arg2 = (float)x;
+ exc.name = (char *)"atanf";
+ exc.type = DOMAIN;
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = HUGE;
+ else
+ exc.retval = nan_with_flags(AMD_F_INVALID);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("atanf: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(atanf)
+#endif
+
+float FN_PROTOTYPE(atanf)(float fx)
+{
+
+ /* Some constants and split constants. */
+
+ static double piby2 = 1.5707963267948966e+00; /* 0x3ff921fb54442d18 */
+
+ double c, v, s, q, z;
+ unsigned int xnan;
+
+ double x = fx;
+
+ /* Find properties of argument fx. */
+
+ unsigned long long ux, aux, xneg;
+
+ GET_BITS_DP64(x, ux);
+ aux = ux & ~SIGNBIT_DP64;
+ xneg = ux & SIGNBIT_DP64;
+
+ v = x;
+ if (xneg) v = -x;
+
+ /* Argument reduction to range [-7/16,7/16] */
+
+ if (aux < 0x3ec0000000000000) /* v < 2.0^(-19) */
+ {
+ /* x is a good approximation to atan(x) */
+ if (aux == 0x0000000000000000)
+ return fx;
+ else
+ return valf_with_flags(fx, AMD_F_INEXACT);
+ }
+ else if (aux < 0x3fdc000000000000) /* v < 7./16. */
+ {
+ x = v;
+ c = 0.0;
+ }
+ else if (aux < 0x3fe6000000000000) /* v < 11./16. */
+ {
+ x = (2.0*v-1.0)/(2.0+v);
+ /* c = arctan(0.5) */
+ c = 4.63647609000806093515e-01; /* 0x3fddac670561bb4f */
+ }
+ else if (aux < 0x3ff3000000000000) /* v < 19./16. */
+ {
+ x = (v-1.0)/(1.0+v);
+ /* c = arctan(1.) */
+ c = 7.85398163397448278999e-01; /* 0x3fe921fb54442d18 */
+ }
+ else if (aux < 0x4003800000000000) /* v < 39./16. */
+ {
+ x = (v-1.5)/(1.0+1.5*v);
+ /* c = arctan(1.5) */
+ c = 9.82793723247329054082e-01; /* 0x3fef730bd281f69b */
+ }
+ else
+ {
+
+ xnan = (aux > PINFBITPATT_DP64);
+
+ if (xnan)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ unsigned int uhx;
+ GET_BITS_SP32(fx, uhx);
+ return handle_errorf("atanf", uhx|0x00400000, _DOMAIN,
+ 0, EDOM, fx, 0.0F);
+#else
+ return x + x; /* Raise invalid if it's a signalling NaN */
+#endif
+ }
+ else if (aux > 0x4190000000000000)
+ { /* abs(x) > 2^26 => arctan(1/x) is
+ insignificant compared to piby2 */
+ if (xneg)
+ return valf_with_flags((float)-piby2, AMD_F_INEXACT);
+ else
+ return valf_with_flags((float)piby2, AMD_F_INEXACT);
+ }
+
+ x = -1.0/v;
+ /* c = arctan(infinity) */
+ c = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */
+ }
+
+ /* Core approximation: Remez(2,2) on [-7/16,7/16] */
+
+ s = x*x;
+ q = x*s*
+ (0.296528598819239217902158651186e0 +
+ (0.192324546402108583211697690500e0 +
+ 0.470677934286149214138357545549e-2*s)*s)/
+ (0.889585796862432286486651434570e0 +
+ (0.111072499995399550138837673349e1 +
+ 0.299309699959659728404442796915e0*s)*s);
+
+ z = c - (q - x);
+
+ if (xneg) z = -z;
+ return (float)z;
+}
+
+weak_alias (__atanf, atanf)
diff --git a/src/atanh.c b/src/atanh.c
new file mode 100644
index 0000000..5815ced
--- /dev/null
+++ b/src/atanh.c
@@ -0,0 +1,193 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_NAN_WITH_FLAGS
+#define USE_VAL_WITH_FLAGS
+#define USE_INFINITY_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x, double retval)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = x;
+ exc.type = DOMAIN;
+ exc.name = (char *)"atanh";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = -HUGE;
+ else
+ exc.retval = retval;
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("atanh: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "atanh"
+double FN_PROTOTYPE(atanh)(double x)
+{
+
+ unsigned long long ux, ax;
+ double r, absx, t, poly;
+
+
+ GET_BITS_DP64(x, ux);
+ ax = ux & ~SIGNBIT_DP64;
+ PUT_BITS_DP64(ax, absx);
+
+ if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_DP64)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is infinity; return a NaN */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return retval_errno_edom(x,nan_with_flags(AMD_F_INVALID));
+#endif
+ }
+ }
+ else if (ax >= 0x3ff0000000000000)
+ {
+ if (ax > 0x3ff0000000000000)
+ {
+ /* abs(x) > 1.0; return NaN */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return retval_errno_edom(x,nan_with_flags(AMD_F_INVALID));
+#endif
+ }
+ else if (ux == 0x3ff0000000000000)
+ {
+ /* x = +1.0; return infinity with the same sign as x
+ and set the divbyzero status flag */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, PINFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return retval_errno_edom(x,infinity_with_flags(AMD_F_DIVBYZERO));
+#endif
+ }
+ else
+ {
+ /* x = -1.0; return infinity with the same sign as x */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, NINFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return retval_errno_edom(x,-infinity_with_flags(AMD_F_DIVBYZERO));
+#endif
+ }
+ }
+
+
+ if (ax < 0x3e30000000000000)
+ {
+ if (ax == 0x0000000000000000)
+ {
+ /* x is +/-zero. Return the same zero. */
+ return x;
+ }
+ else
+ {
+ /* Arguments smaller than 2^(-28) in magnitude are
+ approximated by atanh(x) = x, raising inexact flag. */
+ return val_with_flags(x, AMD_F_INEXACT);
+ }
+ }
+ else
+ {
+ if (ax < 0x3fe0000000000000)
+ {
+ /* Arguments up to 0.5 in magnitude are
+ approximated by a [5,5] minimax polynomial */
+ t = x*x;
+ poly =
+ (0.47482573589747356373e0 +
+ (-0.11028356797846341457e1 +
+ (0.88468142536501647470e0 +
+ (-0.28180210961780814148e0 +
+ (0.28728638600548514553e-1 -
+ 0.10468158892753136958e-3 * t) * t) * t) * t) * t) /
+ (0.14244772076924206909e1 +
+ (-0.41631933639693546274e1 +
+ (0.45414700626084508355e1 +
+ (-0.22608883748988489342e1 +
+ (0.49561196555503101989e0 -
+ 0.35861554370169537512e-1 * t) * t) * t) * t) * t);
+ return x + x*t*poly;
+ }
+ else
+ {
+ /* abs(x) >= 0.5 */
+ /* Note that
+ atanh(x) = 0.5 * ln((1+x)/(1-x))
+ (see Abramowitz and Stegun 4.6.22).
+ For greater accuracy we use the variant formula
+ atanh(x) = log(1 + 2x/(1-x)) = log1p(2x/(1-x)).
+ */
+ r = (2.0 * absx) / (1.0 - absx);
+ r = 0.5 * FN_PROTOTYPE(log1p)(r);
+ if (ux & SIGNBIT_DP64)
+ /* Argument x is negative */
+ return -r;
+ else
+ return r;
+ }
+ }
+}
+
+weak_alias (__atanh, atanh)
diff --git a/src/atanhf.c b/src/atanhf.c
new file mode 100644
index 0000000..38692b4
--- /dev/null
+++ b/src/atanhf.c
@@ -0,0 +1,194 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include <stdio.h>
+
+#define USE_NANF_WITH_FLAGS
+#define USE_VALF_WITH_FLAGS
+#define USE_INFINITYF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NANF_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+#undef USE_INFINITYF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x, float retval)
+{
+ struct exception exc;
+ exc.arg1 = (double)x;
+ exc.arg2 = (double)x;
+ exc.type = DOMAIN;
+ exc.name = (char *)"atanhf";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = -HUGE;
+ else
+ exc.retval = (double)retval;
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("atanhf: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "atanhf"
+float FN_PROTOTYPE(atanhf)(float x)
+{
+
+ double dx;
+ unsigned int ux, ax;
+ double r, t, poly;
+
+ GET_BITS_SP32(x, ux);
+ ax = ux & ~SIGNBIT_SP32;
+
+ if ((ux & EXPBITS_SP32) == EXPBITS_SP32)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_SP32)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN,
+ 0, EDOM, x, 0.0F);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is infinity; return a NaN */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return retval_errno_edom(x,nanf_with_flags(AMD_F_INVALID));
+#endif
+ }
+ }
+ else if (ax >= 0x3f800000)
+ {
+ if (ax > 0x3f800000)
+ {
+ /* abs(x) > 1.0; return NaN */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return retval_errno_edom(x,nanf_with_flags(AMD_F_INVALID));
+#endif
+ }
+ else if (ux == 0x3f800000)
+ {
+ /* x = +1.0; return infinity with the same sign as x
+ and set the divbyzero status flag */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, PINFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return retval_errno_edom(x,infinityf_with_flags(AMD_F_DIVBYZERO));
+#endif
+ }
+ else
+ {
+ /* x = -1.0; return infinity with the same sign as x */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, NINFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return retval_errno_edom(x,-infinityf_with_flags(AMD_F_DIVBYZERO));
+#endif
+ }
+ }
+
+ if (ax < 0x39000000)
+ {
+ if (ax == 0x00000000)
+ {
+ /* x is +/-zero. Return the same zero. */
+ return x;
+ }
+ else
+ {
+ /* Arguments smaller than 2^(-13) in magnitude are
+ approximated by atanhf(x) = x, raising inexact flag. */
+ return valf_with_flags(x, AMD_F_INEXACT);
+ }
+ }
+ else
+ {
+ dx = x;
+ if (ax < 0x3f000000)
+ {
+ /* Arguments up to 0.5 in magnitude are
+ approximated by a [2,2] minimax polynomial */
+ t = dx*dx;
+ poly =
+ (0.39453629046e0 +
+ (-0.28120347286e0 +
+ 0.92834212715e-2 * t) * t) /
+ (0.11836088638e1 +
+ (-0.15537744551e1 +
+ 0.45281890445e0 * t) * t);
+ return (float)(dx + dx*t*poly);
+ }
+ else
+ {
+ /* abs(x) >= 0.5 */
+ /* Note that
+ atanhf(x) = 0.5 * ln((1+x)/(1-x))
+ (see Abramowitz and Stegun 4.6.22).
+ For greater accuracy we use the variant formula
+ atanhf(x) = log(1 + 2x/(1-x)) = log1p(2x/(1-x)).
+ */
+ if (ux & SIGNBIT_SP32)
+ {
+ /* Argument x is negative */
+ r = (-2.0 * dx) / (1.0 + dx);
+ r = 0.5 * FN_PROTOTYPE(log1p)(r);
+ return (float)-r;
+ }
+ else
+ {
+ r = (2.0 * dx) / (1.0 - dx);
+ r = 0.5 * FN_PROTOTYPE(log1p)(r);
+ return (float)r;
+ }
+ }
+ }
+}
+
+weak_alias (__atanhf, atanhf)
diff --git a/src/ceil.c b/src/ceil.c
new file mode 100644
index 0000000..94ef21d
--- /dev/null
+++ b/src/ceil.c
@@ -0,0 +1,104 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERROR
+#endif
+
+#ifdef WINDOWS
+#pragma function(ceil)
+#endif
+
+double FN_PROTOTYPE(ceil)(double x)
+{
+ double r;
+ long long rexp, xneg;
+ unsigned long long ux, ax, ur, mask;
+
+ GET_BITS_DP64(x, ux);
+ /*ax is |x|*/
+ ax = ux & (~SIGNBIT_DP64);
+ /*xneg stores the sign of the input x*/
+ xneg = (ux != ax);
+ /*The range is divided into
+ > 2^53. ceil will either the number itself or Nan
+ always returns a QNan. Raises exception if input is a SNan
+ < 1.0 If 0.0 then return with the appropriate sign
+ If input is less than -0.0 and greater than -1.0 then return -0.0
+ If input is greater than 0.0 and less than 1.0 then return 1.0
+ 1.0 < |x| < 2^53
+ appropriately check the exponent and set the return Value by shifting
+ */
+ if (ax >= 0x4340000000000000) /* abs(x) > 2^53*/
+ {
+ /* abs(x) is either NaN, infinity, or >= 2^53 */
+ if (ax > 0x7ff0000000000000)
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_error("ceil", ux|0x0008000000000000, _DOMAIN, 0,
+ EDOM, x, 0.0);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ else
+ return x;
+ }
+ else if (ax < 0x3ff0000000000000) /* abs(x) < 1.0 */
+ {
+ if (ax == 0x0000000000000000)
+ /* x is +zero or -zero; return the same zero */
+ return x;
+ else if (xneg) /* x < 0.0; return -0.0 */
+ {
+ PUT_BITS_DP64(0x8000000000000000, r);
+ return r;
+ }
+ else
+ return 1.0;
+ }
+ else
+ {
+ /*Get the exponent for the floating point number. Should be between 0 and 53*/
+ rexp = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+ /* Mask out the bits of r that we don't want */
+ mask = 1;
+ mask = (mask << (EXPSHIFTBITS_DP64 - rexp)) - 1;
+ /*Keeps the exponent part and the required mantissa.*/
+ ur = (ux & ~mask);
+ PUT_BITS_DP64(ur, r);
+ if (xneg || (ur == ux))
+ return r;
+ else
+ /* We threw some bits away and x was positive */
+ return r + 1.0;
+ }
+
+}
+
+weak_alias (__ceil, ceil)
diff --git a/src/ceilf.c b/src/ceilf.c
new file mode 100644
index 0000000..56d0c37
--- /dev/null
+++ b/src/ceilf.c
@@ -0,0 +1,97 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERRORF
+#endif
+
+#ifdef WINDOWS
+#pragma function(ceilf)
+#endif
+
+float FN_PROTOTYPE(ceilf)(float x)
+{
+ float r;
+ int rexp, xneg;
+ unsigned int ux, ax, ur, mask;
+
+ GET_BITS_SP32(x, ux);
+ /*ax is |x|*/
+ ax = ux & (~SIGNBIT_SP32);
+ /*xneg stores the sign of the input x*/
+ xneg = (ux != ax);
+ /*The range is divided into
+ > 2^24. ceil will either the number itself or Nan
+ always returns a QNan. Raises exception if input is a SNan
+ < 1.0 If 0.0 then return with the appropriate sign
+ If input is less than -0.0 and greater than -1.0 then return -0.0
+ If input is greater than 0.0 and less than 1.0 then return 1.0
+ 1.0 < |x| < 2^24
+ appropriately check the exponent and set the return Value by shifting
+ */
+ if (ax >= 0x4b800000) /* abs(x) > 2^24*/
+ {
+ /* abs(x) is either NaN, infinity, or >= 2^24 */
+ if (ax > 0x7f800000)
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_errorf("ceilf", ux, _DOMAIN, 0, EDOM, x, 0.0F);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ else
+ return x;
+ }
+ else if (ax < 0x3f800000) /* abs(x) < 1.0 */
+ {
+ if (ax == 0x00000000)
+ /* x is +zero or -zero; return the same zero */
+ return x;
+ else if (xneg) /* x < 0.0 */
+ return -0.0F;
+ else
+ return 1.0F;
+ }
+ else
+ {
+ rexp = ((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+ /* Mask out the bits of r that we don't want */
+ mask = (1 << (EXPSHIFTBITS_SP32 - rexp)) - 1;
+ /*Keeps the exponent part and the required mantissa.*/
+ ur = (ux & ~mask);
+ PUT_BITS_SP32(ur, r);
+
+ if (xneg || (ux == ur)) return r;
+ else
+ /* We threw some bits away and x was positive */
+ return r + 1.0F;
+ }
+}
+
+weak_alias (__ceilf, ceilf)
diff --git a/src/cosh.c b/src/cosh.c
new file mode 100644
index 0000000..6f8734b
--- /dev/null
+++ b/src/cosh.c
@@ -0,0 +1,359 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_SPLITEXP
+#define USE_SCALEDOUBLE_1
+#define USE_SCALEDOUBLE_2
+#define USE_INFINITY_WITH_FLAGS
+#define USE_VAL_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERROR
+#undef USE_SPLITEXP
+#undef USE_SCALEDOUBLE_1
+#undef USE_SCALEDOUBLE_2
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+
+#include "../inc/libm_errno_amd.h"
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline double retval_errno_erange(double x)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = x;
+ exc.type = OVERFLOW;
+ exc.name = (char *)"cosh";
+ if (_LIB_VERSION == _SVID_)
+ {
+ exc.retval = HUGE;
+ }
+ else
+ {
+ exc.retval = infinity_with_flags(AMD_F_OVERFLOW);
+ }
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(ERANGE);
+ else if (!matherr(&exc))
+ __set_errno(ERANGE);
+ return exc.retval;
+}
+#endif
+
+double FN_PROTOTYPE(cosh)(double x)
+{
+ /*
+ Derived from sinh subroutine
+
+ After dealing with special cases the computation is split into
+ regions as follows:
+
+ abs(x) >= max_cosh_arg:
+ cosh(x) = sign(x)*Inf
+
+ abs(x) >= small_threshold:
+ cosh(x) = sign(x)*exp(abs(x))/2 computed using the
+ splitexp and scaleDouble functions as for exp_amd().
+
+ abs(x) < small_threshold:
+ compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0)))
+ cosh(x) is then sign(x)*z. */
+
+ static const double
+ max_cosh_arg = 7.10475860073943977113e+02, /* 0x408633ce8fb9f87e */
+ thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */
+ log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */
+ log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */
+// small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889;
+ small_threshold = 20.0;
+ /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */
+
+ /* Lead and tail tabulated values of sinh(i) and cosh(i)
+ for i = 0,...,36. The lead part has 26 leading bits. */
+
+ static const double sinh_lead[ 37] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.17520117759704589844e+00, /* 0x3ff2cd9fc0000000 */
+ 3.62686038017272949219e+00, /* 0x400d03cf60000000 */
+ 1.00178747177124023438e+01, /* 0x40240926e0000000 */
+ 2.72899169921875000000e+01, /* 0x403b4a3800000000 */
+ 7.42032089233398437500e+01, /* 0x40528d0160000000 */
+ 2.01713153839111328125e+02, /* 0x406936d228000000 */
+ 5.48316116333007812500e+02, /* 0x4081228768000000 */
+ 1.49047882080078125000e+03, /* 0x409749ea50000000 */
+ 4.05154187011718750000e+03, /* 0x40afa71570000000 */
+ 1.10132326660156250000e+04, /* 0x40c5829dc8000000 */
+ 2.99370708007812500000e+04, /* 0x40dd3c4488000000 */
+ 8.13773945312500000000e+04, /* 0x40f3de1650000000 */
+ 2.21206695312500000000e+05, /* 0x410b00b590000000 */
+ 6.01302140625000000000e+05, /* 0x412259ac48000000 */
+ 1.63450865625000000000e+06, /* 0x4138f0cca8000000 */
+ 4.44305525000000000000e+06, /* 0x4150f2ebd0000000 */
+ 1.20774762500000000000e+07, /* 0x4167093488000000 */
+ 3.28299845000000000000e+07, /* 0x417f4f2208000000 */
+ 8.92411500000000000000e+07, /* 0x419546d8f8000000 */
+ 2.42582596000000000000e+08, /* 0x41aceb0888000000 */
+ 6.59407856000000000000e+08, /* 0x41c3a6e1f8000000 */
+ 1.79245641600000000000e+09, /* 0x41dab5adb8000000 */
+ 4.87240166400000000000e+09, /* 0x41f226af30000000 */
+ 1.32445608960000000000e+10, /* 0x4208ab7fb0000000 */
+ 3.60024494080000000000e+10, /* 0x4220c3d390000000 */
+ 9.78648043520000000000e+10, /* 0x4236c93268000000 */
+ 2.66024116224000000000e+11, /* 0x424ef822f0000000 */
+ 7.23128516608000000000e+11, /* 0x42650bba30000000 */
+ 1.96566712320000000000e+12, /* 0x427c9aae40000000 */
+ 5.34323724288000000000e+12, /* 0x4293704708000000 */
+ 1.45244246507520000000e+13, /* 0x42aa6b7658000000 */
+ 3.94814795284480000000e+13, /* 0x42c1f43fc8000000 */
+ 1.07321789251584000000e+14, /* 0x42d866f348000000 */
+ 2.91730863685632000000e+14, /* 0x42f0953e28000000 */
+ 7.93006722514944000000e+14, /* 0x430689e220000000 */
+ 2.15561576592179200000e+15}; /* 0x431ea215a0000000 */
+
+ static const double sinh_tail[ 37] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.60467555584448807892e-08, /* 0x3e513ae6096a0092 */
+ 2.76742892754807136947e-08, /* 0x3e5db70cfb79a640 */
+ 2.09697499555224576530e-07, /* 0x3e8c2526b66dc067 */
+ 2.04940252448908240062e-07, /* 0x3e8b81b18647f380 */
+ 1.65444891522700935932e-06, /* 0x3ebbc1cdd1e1eb08 */
+ 3.53116789999998198721e-06, /* 0x3ecd9f201534fb09 */
+ 6.94023870987375490695e-06, /* 0x3edd1c064a4e9954 */
+ 4.98876893611587449271e-06, /* 0x3ed4eca65d06ea74 */
+ 3.19656024605152215752e-05, /* 0x3f00c259bcc0ecc5 */
+ 2.08687768377236501204e-04, /* 0x3f2b5a6647cf9016 */
+ 4.84668088325403796299e-05, /* 0x3f09691adefb0870 */
+ 1.17517985422733832468e-03, /* 0x3f53410fc29cde38 */
+ 6.90830086959560562415e-04, /* 0x3f46a31a50b6fb3c */
+ 1.45697262451506548420e-03, /* 0x3f57defc71805c40 */
+ 2.99859023684906737806e-02, /* 0x3f9eb49fd80e0bab */
+ 1.02538800507941396667e-02, /* 0x3f84fffc7bcd5920 */
+ 1.26787628407699110022e-01, /* 0x3fc03a93b6c63435 */
+ 6.86652479544033744752e-02, /* 0x3fb1940bb255fd1c */
+ 4.81593627621056619148e-01, /* 0x3fded26e14260b50 */
+ 1.70489513795397629181e+00, /* 0x3ffb47401fc9f2a2 */
+ 1.12416073482258713767e+01, /* 0x40267bb3f55634f1 */
+ 7.06579578070110514432e+00, /* 0x401c435ff8194ddc */
+ 5.91244512999659974639e+01, /* 0x404d8fee052ba63a */
+ 1.68921736147050694399e+02, /* 0x40651d7edccde3f6 */
+ 2.60692936262073658327e+02, /* 0x40704b1644557d1a */
+ 3.62419382134885609048e+02, /* 0x4076a6b5ca0a9dc4 */
+ 4.07689930834187271103e+03, /* 0x40afd9cc72249aba */
+ 1.55377375868385224749e+04, /* 0x40ce58de693edab5 */
+ 2.53720210371943067003e+04, /* 0x40d8c70158ac6363 */
+ 4.78822310734952334315e+04, /* 0x40e7614764f43e20 */
+ 1.81871712615542812273e+05, /* 0x4106337db36fc718 */
+ 5.62892347580489004031e+05, /* 0x41212d98b1f611e2 */
+ 6.41374032312148716301e+05, /* 0x412392bc108b37cc */
+ 7.57809544070145115256e+06, /* 0x415ce87bdc3473dc */
+ 3.64177136406482197344e+06, /* 0x414bc8d5ae99ad14 */
+ 7.63580561355670914054e+06}; /* 0x415d20d76744835c */
+
+ static const double cosh_lead[ 37] = {
+ 1.00000000000000000000e+00, /* 0x3ff0000000000000 */
+ 1.54308062791824340820e+00, /* 0x3ff8b07550000000 */
+ 3.76219564676284790039e+00, /* 0x400e18fa08000000 */
+ 1.00676617622375488281e+01, /* 0x402422a490000000 */
+ 2.73082327842712402344e+01, /* 0x403b4ee858000000 */
+ 7.42099475860595703125e+01, /* 0x40528d6fc8000000 */
+ 2.01715633392333984375e+02, /* 0x406936e678000000 */
+ 5.48317031860351562500e+02, /* 0x4081228948000000 */
+ 1.49047915649414062500e+03, /* 0x409749eaa8000000 */
+ 4.05154199218750000000e+03, /* 0x40afa71580000000 */
+ 1.10132329101562500000e+04, /* 0x40c5829dd0000000 */
+ 2.99370708007812500000e+04, /* 0x40dd3c4488000000 */
+ 8.13773945312500000000e+04, /* 0x40f3de1650000000 */
+ 2.21206695312500000000e+05, /* 0x410b00b590000000 */
+ 6.01302140625000000000e+05, /* 0x412259ac48000000 */
+ 1.63450865625000000000e+06, /* 0x4138f0cca8000000 */
+ 4.44305525000000000000e+06, /* 0x4150f2ebd0000000 */
+ 1.20774762500000000000e+07, /* 0x4167093488000000 */
+ 3.28299845000000000000e+07, /* 0x417f4f2208000000 */
+ 8.92411500000000000000e+07, /* 0x419546d8f8000000 */
+ 2.42582596000000000000e+08, /* 0x41aceb0888000000 */
+ 6.59407856000000000000e+08, /* 0x41c3a6e1f8000000 */
+ 1.79245641600000000000e+09, /* 0x41dab5adb8000000 */
+ 4.87240166400000000000e+09, /* 0x41f226af30000000 */
+ 1.32445608960000000000e+10, /* 0x4208ab7fb0000000 */
+ 3.60024494080000000000e+10, /* 0x4220c3d390000000 */
+ 9.78648043520000000000e+10, /* 0x4236c93268000000 */
+ 2.66024116224000000000e+11, /* 0x424ef822f0000000 */
+ 7.23128516608000000000e+11, /* 0x42650bba30000000 */
+ 1.96566712320000000000e+12, /* 0x427c9aae40000000 */
+ 5.34323724288000000000e+12, /* 0x4293704708000000 */
+ 1.45244246507520000000e+13, /* 0x42aa6b7658000000 */
+ 3.94814795284480000000e+13, /* 0x42c1f43fc8000000 */
+ 1.07321789251584000000e+14, /* 0x42d866f348000000 */
+ 2.91730863685632000000e+14, /* 0x42f0953e28000000 */
+ 7.93006722514944000000e+14, /* 0x430689e220000000 */
+ 2.15561576592179200000e+15}; /* 0x431ea215a0000000 */
+
+ static const double cosh_tail[ 37] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 6.89700037027478056904e-09, /* 0x3e3d9f5504c2bd28 */
+ 4.43207835591715833630e-08, /* 0x3e67cb66f0a4c9fd */
+ 2.33540217013828929694e-07, /* 0x3e8f58617928e588 */
+ 5.17452463948269748331e-08, /* 0x3e6bc7d000c38d48 */
+ 9.38728274131605919153e-07, /* 0x3eaf7f9d4e329998 */
+ 2.73012191010840495544e-06, /* 0x3ec6e6e464885269 */
+ 3.29486051438996307950e-06, /* 0x3ecba3a8b946c154 */
+ 4.75803746362771416375e-06, /* 0x3ed3f4e76110d5a4 */
+ 3.33050940471947692369e-05, /* 0x3f017622515a3e2b */
+ 9.94707313972136215365e-06, /* 0x3ee4dc4b528af3d0 */
+ 6.51685096227860253398e-05, /* 0x3f11156278615e10 */
+ 1.18132406658066663359e-03, /* 0x3f535ad50ed821f5 */
+ 6.93090416366541877541e-04, /* 0x3f46b61055f2935c */
+ 1.45780415323416845386e-03, /* 0x3f57e2794a601240 */
+ 2.99862082708111758744e-02, /* 0x3f9eb4b45f6aadd3 */
+ 1.02539925859688602072e-02, /* 0x3f85000b967b3698 */
+ 1.26787669807076286421e-01, /* 0x3fc03a940fadc092 */
+ 6.86652631843830962843e-02, /* 0x3fb1940bf3bf874c */
+ 4.81593633223853068159e-01, /* 0x3fded26e1a2a2110 */
+ 1.70489514001513020602e+00, /* 0x3ffb4740205796d6 */
+ 1.12416073489841270572e+01, /* 0x40267bb3f55cb85d */
+ 7.06579578098005001152e+00, /* 0x401c435ff81e18ac */
+ 5.91244513000686140458e+01, /* 0x404d8fee052bdea4 */
+ 1.68921736147088438429e+02, /* 0x40651d7edccde926 */
+ 2.60692936262087528121e+02, /* 0x40704b1644557e0e */
+ 3.62419382134890611269e+02, /* 0x4076a6b5ca0a9e1c */
+ 4.07689930834187453002e+03, /* 0x40afd9cc72249abe */
+ 1.55377375868385224749e+04, /* 0x40ce58de693edab5 */
+ 2.53720210371943103382e+04, /* 0x40d8c70158ac6364 */
+ 4.78822310734952334315e+04, /* 0x40e7614764f43e20 */
+ 1.81871712615542812273e+05, /* 0x4106337db36fc718 */
+ 5.62892347580489004031e+05, /* 0x41212d98b1f611e2 */
+ 6.41374032312148716301e+05, /* 0x412392bc108b37cc */
+ 7.57809544070145115256e+06, /* 0x415ce87bdc3473dc */
+ 3.64177136406482197344e+06, /* 0x414bc8d5ae99ad14 */
+ 7.63580561355670914054e+06}; /* 0x415d20d76744835c */
+
+ unsigned long long ux, aux, xneg;
+ double y, z, z1, z2;
+ int m;
+
+ /* Special cases */
+
+ GET_BITS_DP64(x, ux);
+ aux = ux & ~SIGNBIT_DP64;
+ if (aux < 0x3e30000000000000) /* |x| small enough that cosh(x) = 1 */
+ {
+ if (aux == 0)
+ /* with no inexact */
+ return 1.0;
+ else
+ return val_with_flags(1.0, AMD_F_INEXACT);
+ }
+ else if (aux >= PINFBITPATT_DP64) /* |x| is NaN or Inf */
+ {
+ if (aux > PINFBITPATT_DP64) /* |x| is a NaN? */
+ return x + x;
+ else /* x is infinity */
+ return infinity_with_flags(0);
+ }
+
+ xneg = (aux != ux);
+
+ y = x;
+ if (xneg) y = -x;
+
+ if (y >= max_cosh_arg)
+ {
+ /* Return +/-infinity with overflow flag */
+#ifdef WINDOWS
+ return handle_error("cosh", PINFBITPATT_DP64, _OVERFLOW,
+ AMD_F_OVERFLOW, EDOM, x, 0.0F);
+#else
+ return retval_errno_erange(x);
+#endif
+
+
+ }
+ else if (y >= small_threshold)
+ {
+ /* In this range y is large enough so that
+ the negative exponential is negligible,
+ so cosh(y) is approximated by sign(x)*exp(y)/2. The
+ code below is an inlined version of that from
+ exp() with two changes (it operates on
+ y instead of x, and the division by 2 is
+ done by reducing m by 1). */
+
+ splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead,
+ log2_by_32_tail, &m, &z1, &z2);
+ m -= 1;
+
+ if (m >= EMIN_DP64 && m <= EMAX_DP64)
+ z = scaleDouble_1((z1+z2),m);
+ else
+ z = scaleDouble_2((z1+z2),m);
+ }
+ else
+ {
+ /* In this range we find the integer part y0 of y
+ and the increment dy = y - y0. We then compute
+
+ z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy)
+ z = cosh(y) = cosh(y0)cosh(dy) + sinh(y0)sinh(dy)
+
+ where sinh(y0) and cosh(y0) are tabulated above. */
+
+ int ind;
+ double dy, dy2, sdy, cdy;
+
+ ind = (int)y;
+ dy = y - ind;
+
+ dy2 = dy*dy;
+ sdy = dy*dy2*(0.166666666666666667013899e0 +
+ (0.833333333333329931873097e-2 +
+ (0.198412698413242405162014e-3 +
+ (0.275573191913636406057211e-5 +
+ (0.250521176994133472333666e-7 +
+ (0.160576793121939886190847e-9 +
+ 0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+ cdy = dy2*(0.500000000000000005911074e0 +
+ (0.416666666666660876512776e-1 +
+ (0.138888888889814854814536e-2 +
+ (0.248015872460622433115785e-4 +
+ (0.275573350756016588011357e-6 +
+ (0.208744349831471353536305e-8 +
+ 0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+ /* At this point sinh(dy) is approximated by dy + sdy, and cosh(dy) is approximated by 1 + cdy.
+ Shift some significant bits from dy to cdy. */
+ z = ((((((cosh_tail[ind]*cdy + sinh_tail[ind]*sdy)
+ + sinh_tail[ind]*dy) + cosh_tail[ind])
+ + cosh_lead[ind]*cdy) + sinh_lead[ind]*sdy)
+ + sinh_lead[ind]*dy) + cosh_lead[ind];
+ }
+
+ return z;
+}
+
+weak_alias (__cosh, cosh)
diff --git a/src/coshf.c b/src/coshf.c
new file mode 100644
index 0000000..ab2b68e
--- /dev/null
+++ b/src/coshf.c
@@ -0,0 +1,268 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_SPLITEXP
+#define USE_SCALEDOUBLE_1
+#define USE_SCALEDOUBLE_2
+#define USE_INFINITYF_WITH_FLAGS
+#define USE_VALF_WITH_FLAGS
+#include "../inc/libm_inlines_amd.h"
+#undef USE_SPLITEXP
+#undef USE_SCALEDOUBLE_1
+#undef USE_SCALEDOUBLE_2
+#undef USE_INFINITYF_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline float retval_errno_erange(float x)
+{
+ struct exception exc;
+ exc.arg1 = (double)x;
+ exc.arg2 = (double)x;
+ exc.type = OVERFLOW;
+ exc.name = (char *)"coshf";
+ if (_LIB_VERSION == _SVID_)
+ {
+ exc.retval = HUGE;
+ }
+ else
+ {
+ exc.retval = infinityf_with_flags(AMD_F_OVERFLOW);
+ }
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(ERANGE);
+ else if (!matherr(&exc))
+ __set_errno(ERANGE);
+ return exc.retval;
+}
+
+#endif
+float FN_PROTOTYPE(coshf)(float fx)
+{
+ /*
+ After dealing with special cases the computation is split into
+ regions as follows:
+
+ abs(x) >= max_cosh_arg:
+ cosh(x) = sign(x)*Inf
+
+ abs(x) >= small_threshold:
+ cosh(x) = sign(x)*exp(abs(x))/2 computed using the
+ splitexp and scaleDouble functions as for exp_amd().
+
+ abs(x) < small_threshold:
+ compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0)))
+ cosh(x) is then sign(x)*z. */
+
+ static const double
+ /* The max argument of coshf, but stored as a double */
+ max_cosh_arg = 8.94159862922329438106e+01, /* 0x40565a9f84f82e63 */
+ thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */
+ log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */
+ log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */
+
+ small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889;
+// small_threshold = 20.0;
+ /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */
+
+ /* Tabulated values of sinh(i) and cosh(i) for i = 0,...,36. */
+
+ static const double sinh_lead[ 37] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.17520119364380137839e+00, /* 0x3ff2cd9fc44eb982 */
+ 3.62686040784701857476e+00, /* 0x400d03cf63b6e19f */
+ 1.00178749274099008204e+01, /* 0x40240926e70949ad */
+ 2.72899171971277496596e+01, /* 0x403b4a3803703630 */
+ 7.42032105777887522891e+01, /* 0x40528d0166f07374 */
+ 2.01713157370279219549e+02, /* 0x406936d22f67c805 */
+ 5.48316123273246489589e+02, /* 0x408122876ba380c9 */
+ 1.49047882578955000099e+03, /* 0x409749ea514eca65 */
+ 4.05154190208278987484e+03, /* 0x40afa7157430966f */
+ 1.10132328747033916443e+04, /* 0x40c5829dced69991 */
+ 2.99370708492480553105e+04, /* 0x40dd3c4488cb48d6 */
+ 8.13773957064298447222e+04, /* 0x40f3de1654d043f0 */
+ 2.21206696003330085659e+05, /* 0x410b00b5916a31a5 */
+ 6.01302142081972560845e+05, /* 0x412259ac48bef7e3 */
+ 1.63450868623590236530e+06, /* 0x4138f0ccafad27f6 */
+ 4.44305526025387924165e+06, /* 0x4150f2ebd0a7ffe3 */
+ 1.20774763767876271158e+07, /* 0x416709348c0ea4ed */
+ 3.28299845686652474105e+07, /* 0x417f4f22091940bb */
+ 8.92411504815936237574e+07, /* 0x419546d8f9ed26e1 */
+ 2.42582597704895108938e+08, /* 0x41aceb088b68e803 */
+ 6.59407867241607308388e+08, /* 0x41c3a6e1fd9eecfd */
+ 1.79245642306579566002e+09, /* 0x41dab5adb9c435ff */
+ 4.87240172312445068359e+09, /* 0x41f226af33b1fdc0 */
+ 1.32445610649217357635e+10, /* 0x4208ab7fb5475fb7 */
+ 3.60024496686929321289e+10, /* 0x4220c3d3920962c8 */
+ 9.78648047144193725586e+10, /* 0x4236c932696a6b5c */
+ 2.66024120300899291992e+11, /* 0x424ef822f7f6731c */
+ 7.23128532145737548828e+11, /* 0x42650bba3796379a */
+ 1.96566714857202099609e+12, /* 0x427c9aae4631c056 */
+ 5.34323729076223046875e+12, /* 0x429370470aec28ec */
+ 1.45244248326237109375e+13, /* 0x42aa6b765d8cdf6c */
+ 3.94814800913403437500e+13, /* 0x42c1f43fcc4b662c */
+ 1.07321789892958031250e+14, /* 0x42d866f34a725782 */
+ 2.91730871263727437500e+14, /* 0x42f0953e2f3a1ef7 */
+ 7.93006726156715250000e+14, /* 0x430689e221bc8d5a */
+ 2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */
+
+ static const double cosh_lead[ 37] = {
+ 1.00000000000000000000e+00, /* 0x3ff0000000000000 */
+ 1.54308063481524371241e+00, /* 0x3ff8b07551d9f550 */
+ 3.76219569108363138810e+00, /* 0x400e18fa0df2d9bc */
+ 1.00676619957777653269e+01, /* 0x402422a497d6185e */
+ 2.73082328360164865444e+01, /* 0x403b4ee858de3e80 */
+ 7.42099485247878334349e+01, /* 0x40528d6fcbeff3a9 */
+ 2.01715636122455890700e+02, /* 0x406936e67db9b919 */
+ 5.48317035155212010977e+02, /* 0x4081228949ba3a8b */
+ 1.49047916125217807348e+03, /* 0x409749eaa93f4e76 */
+ 4.05154202549259389343e+03, /* 0x40afa715845d8894 */
+ 1.10132329201033226127e+04, /* 0x40c5829dd053712d */
+ 2.99370708659497577173e+04, /* 0x40dd3c4489115627 */
+ 8.13773957125740562333e+04, /* 0x40f3de1654d6b543 */
+ 2.21206696005590405548e+05, /* 0x410b00b5916b6105 */
+ 6.01302142082804115489e+05, /* 0x412259ac48bf13ca */
+ 1.63450868623620807193e+06, /* 0x4138f0ccafad2d17 */
+ 4.44305526025399193168e+06, /* 0x4150f2ebd0a8005c */
+ 1.20774763767876680940e+07, /* 0x416709348c0ea503 */
+ 3.28299845686652623117e+07, /* 0x417f4f22091940bf */
+ 8.92411504815936237574e+07, /* 0x419546d8f9ed26e1 */
+ 2.42582597704895138741e+08, /* 0x41aceb088b68e804 */
+ 6.59407867241607308388e+08, /* 0x41c3a6e1fd9eecfd */
+ 1.79245642306579566002e+09, /* 0x41dab5adb9c435ff */
+ 4.87240172312445068359e+09, /* 0x41f226af33b1fdc0 */
+ 1.32445610649217357635e+10, /* 0x4208ab7fb5475fb7 */
+ 3.60024496686929321289e+10, /* 0x4220c3d3920962c8 */
+ 9.78648047144193725586e+10, /* 0x4236c932696a6b5c */
+ 2.66024120300899291992e+11, /* 0x424ef822f7f6731c */
+ 7.23128532145737548828e+11, /* 0x42650bba3796379a */
+ 1.96566714857202099609e+12, /* 0x427c9aae4631c056 */
+ 5.34323729076223046875e+12, /* 0x429370470aec28ec */
+ 1.45244248326237109375e+13, /* 0x42aa6b765d8cdf6c */
+ 3.94814800913403437500e+13, /* 0x42c1f43fcc4b662c */
+ 1.07321789892958031250e+14, /* 0x42d866f34a725782 */
+ 2.91730871263727437500e+14, /* 0x42f0953e2f3a1ef7 */
+ 7.93006726156715250000e+14, /* 0x430689e221bc8d5a */
+ 2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */
+
+ unsigned long long ux, aux, xneg;
+ double x = fx, y, z, z1, z2;
+ int m;
+
+ /* Special cases */
+
+ GET_BITS_DP64(x, ux);
+ aux = ux & ~SIGNBIT_DP64;
+ if (aux < 0x3f10000000000000) /* |x| small enough that cosh(x) = 1 */
+ {
+ if (aux == 0) return (float)1.0; /* with no inexact */
+ if (LAMBDA_DP64 + x > 1.0) return valf_with_flags((float)1.0, AMD_F_INEXACT); /* with inexact */
+ }
+ else if (aux >= PINFBITPATT_DP64) /* |x| is NaN or Inf */
+ {
+ if (aux > PINFBITPATT_DP64) /* |x| is a NaN? */
+ return fx + fx;
+ else /* x is infinity */
+ return infinityf_with_flags(0);
+ }
+
+ xneg = (aux != ux);
+
+ y = x;
+ if (xneg) y = -x;
+
+ if (y >= max_cosh_arg)
+ {
+ /* Return infinity with overflow flag. */
+ /* This handles POSIX behaviour */
+ __set_errno(ERANGE);
+ z = infinityf_with_flags(AMD_F_OVERFLOW);
+ }
+ else if (y >= small_threshold)
+ {
+ /* In this range y is large enough so that
+ the negative exponential is negligible,
+ so cosh(y) is approximated by sign(x)*exp(y)/2. The
+ code below is an inlined version of that from
+ exp() with two changes (it operates on
+ y instead of x, and the division by 2 is
+ done by reducing m by 1). */
+
+ splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead,
+ log2_by_32_tail, &m, &z1, &z2);
+ m -= 1;
+
+ /* scaleDouble_1 is always safe because the argument x was
+ float, rather than double */
+
+ z = scaleDouble_1((z1+z2),m);
+ }
+ else
+ {
+ /* In this range we find the integer part y0 of y
+ and the increment dy = y - y0. We then compute
+
+ z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy)
+ z = cosh(y) = cosh(y0)cosh(dy) + sinh(y0)sinh(dy)
+
+ where sinh(y0) and cosh(y0) are tabulated above. */
+
+ int ind;
+ double dy, dy2, sdy, cdy;
+
+ ind = (int)y;
+ dy = y - ind;
+
+ dy2 = dy*dy;
+
+ sdy = dy + dy*dy2*(0.166666666666666667013899e0 +
+ (0.833333333333329931873097e-2 +
+ (0.198412698413242405162014e-3 +
+ (0.275573191913636406057211e-5 +
+ (0.250521176994133472333666e-7 +
+ (0.160576793121939886190847e-9 +
+ 0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+ cdy = 1 + dy2*(0.500000000000000005911074e0 +
+ (0.416666666666660876512776e-1 +
+ (0.138888888889814854814536e-2 +
+ (0.248015872460622433115785e-4 +
+ (0.275573350756016588011357e-6 +
+ (0.208744349831471353536305e-8 +
+ 0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+ z = cosh_lead[ind]*cdy + sinh_lead[ind]*sdy;
+ }
+
+// if (xneg) z = - z;
+ return (float)z;
+}
+
+weak_alias (__coshf, coshf)
diff --git a/src/exp_special.c b/src/exp_special.c
new file mode 100644
index 0000000..ca32ec2
--- /dev/null
+++ b/src/exp_special.c
@@ -0,0 +1,110 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+#ifdef __x86_64__
+
+#include <emmintrin.h>
+#include <math.h>
+#include <errno.h>
+
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+// y = expf(x)
+// y = exp(x)
+
+// these codes and the ones in the related .S or .asm files have to match
+#define EXP_X_NAN 1
+#define EXP_Y_ZERO 2
+#define EXP_Y_INF 3
+
+float _expf_special(float x, float y, U32 code)
+{
+ switch(code)
+ {
+ case EXP_X_NAN:
+ {
+#ifdef WIN64
+ // y is assumed to be qnan, only check x for snan
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "expf", x, is_x_snan, 0.0f, 0, y, 0);
+#else
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+ }
+ break;
+
+ case EXP_Y_ZERO:
+ {
+ _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW));
+ __amd_handle_errorf(UNDERFLOW, ERANGE, "expf", x, 0, 0.0f, 0, y, 0);
+ }
+ break;
+
+ case EXP_Y_INF:
+ {
+ _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW));
+ __amd_handle_errorf(OVERFLOW, ERANGE, "expf", x, 0, 0.0f, 0, y, 0);
+ }
+ break;
+ }
+
+
+ return y;
+}
+
+double _exp_special(double x, double y, U32 code)
+{
+ switch(code)
+ {
+ case EXP_X_NAN:
+ {
+#ifdef WIN64
+ __amd_handle_error(DOMAIN, EDOM, "exp", x, 0.0, y);
+#else
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+ }
+ break;
+
+ case EXP_Y_ZERO:
+ {
+ _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW));
+ __amd_handle_error(UNDERFLOW, ERANGE, "exp", x, 0.0, y);
+ }
+ break;
+
+ case EXP_Y_INF:
+ {
+ _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW));
+ __amd_handle_error(OVERFLOW, ERANGE, "exp", x, 0.0, y);
+ }
+ break;
+ }
+
+
+ return y;
+}
+
+#endif /* __x86_64__ */
diff --git a/src/finite.c b/src/finite.c
new file mode 100644
index 0000000..7e7ca39
--- /dev/null
+++ b/src/finite.c
@@ -0,0 +1,60 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+/* Returns 0 if x is infinite or NaN, otherwise returns 1 */
+
+#ifdef WINDOWS
+int FN_PROTOTYPE(finite)(double x)
+#else
+int FN_PROTOTYPE(finite)(double x)
+#endif
+{
+
+#ifdef WINDOWS
+
+ unsigned long long ux;
+ GET_BITS_DP64(x, ux);
+ return (int)(((ux & ~SIGNBIT_DP64) - PINFBITPATT_DP64) >> 63);
+
+#else
+
+ /* This works on Hammer with gcc */
+ unsigned long ux =0x7ff0000000000000 ;
+ double temp;
+ PUT_BITS_DP64(ux, temp);
+
+ // double temp = 1.0e444; /* = infinity = 0x7ff0000000000000 */
+ volatile int retval;
+ retval = 0;
+ asm volatile ("andpd %0, %1;" : : "x" (temp), "x" (x));
+ asm volatile ("comisd %0, %1" : : "x" (temp), "x" (x));
+ asm volatile ("setnz %0" : "=g" (retval));
+ return retval;
+
+#endif
+}
+
+weak_alias (__finite, finite)
diff --git a/src/finitef.c b/src/finitef.c
new file mode 100644
index 0000000..8c0613a
--- /dev/null
+++ b/src/finitef.c
@@ -0,0 +1,60 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+/* Returns 0 if x is infinite or NaN, otherwise returns 1 */
+
+#ifdef WINDOWS
+int FN_PROTOTYPE(finitef)(float x)
+#else
+int FN_PROTOTYPE(finitef)(float x)
+#endif
+{
+
+#ifdef WINDOWS
+
+ unsigned int ux;
+ GET_BITS_SP32(x, ux);
+ return (int)(((ux & ~SIGNBIT_SP32) - PINFBITPATT_SP32) >> 31);
+
+#else
+
+ /* This works on Hammer */
+ unsigned int ux=0x7f800000;
+ float temp;
+ PUT_BITS_SP32(ux, temp);
+
+ /* float temp = 1.0e444; *//* = infinity = 0x7f800000 */
+ volatile int retval;
+ retval = 0;
+ asm volatile ("andps %0, %1;" : : "x" (temp), "x" (x));
+ asm volatile ("comiss %0, %1" : : "x" (temp), "x" (x));
+ asm volatile ("setnz %0" : "=g" (retval));
+ return retval;
+
+#endif
+}
+
+weak_alias (__finitef, finitef)
diff --git a/src/floor.c b/src/floor.c
new file mode 100644
index 0000000..a1b99c5
--- /dev/null
+++ b/src/floor.c
@@ -0,0 +1,92 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERROR
+#endif
+
+#ifdef WINDOWS
+#pragma function(floor)
+#endif
+
+double FN_PROTOTYPE(floor)(double x)
+{
+ double r;
+ long long rexp, xneg;
+
+
+ unsigned long long ux, ax, ur, mask;
+
+ GET_BITS_DP64(x, ux);
+ ax = ux & (~SIGNBIT_DP64);
+ xneg = (ux != ax);
+
+ if (ax >= 0x4340000000000000)
+ {
+ /* abs(x) is either NaN, infinity, or >= 2^53 */
+ if (ax > 0x7ff0000000000000)
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_error("floor", ux|0x0008000000000000, _DOMAIN,
+ 0, EDOM, x, 0.0);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ else
+ return x;
+ }
+ else if (ax < 0x3ff0000000000000) /* abs(x) < 1.0 */
+ {
+ if (ax == 0x0000000000000000)
+ /* x is +zero or -zero; return the same zero */
+ return x;
+ else if (xneg) /* x < 0.0 */
+ return -1.0;
+ else
+ return 0.0;
+ }
+ else
+ {
+ r = x;
+ rexp = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+ /* Mask out the bits of r that we don't want */
+ mask = 1;
+ mask = (mask << (EXPSHIFTBITS_DP64 - rexp)) - 1;
+ ur = (ux & ~mask);
+ PUT_BITS_DP64(ur, r);
+ if (xneg && (ur != ux))
+ /* We threw some bits away and x was negative */
+ return r - 1.0;
+ else
+ return r;
+ }
+
+}
+
+weak_alias (__floor, floor)
diff --git a/src/floorf.c b/src/floorf.c
new file mode 100644
index 0000000..e0f855b
--- /dev/null
+++ b/src/floorf.c
@@ -0,0 +1,87 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERRORF
+#endif
+
+#ifdef WINDOWS
+#pragma function(floorf)
+#endif
+
+float FN_PROTOTYPE(floorf)(float x)
+{
+ float r;
+ int rexp, xneg;
+ unsigned int ux, ax, ur, mask;
+
+ GET_BITS_SP32(x, ux);
+ ax = ux & (~SIGNBIT_SP32);
+ xneg = (ux != ax);
+
+ if (ax >= 0x4b800000)
+ {
+ /* abs(x) is either NaN, infinity, or >= 2^24 */
+ if (ax > 0x7f800000)
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_errorf("floorf", ux|0x00400000, _DOMAIN,
+ 0, EDOM, x, 0.0F);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ else
+ return x;
+ }
+ else if (ax < 0x3f800000) /* abs(x) < 1.0 */
+ {
+ if (ax == 0x00000000)
+ /* x is +zero or -zero; return the same zero */
+ return x;
+ else if (xneg) /* x < 0.0 */
+ return -1.0F;
+ else
+ return 0.0F;
+ }
+ else
+ {
+ rexp = ((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+ /* Mask out the bits of r that we don't want */
+ mask = (1 << (EXPSHIFTBITS_SP32 - rexp)) - 1;
+ ur = (ux & ~mask);
+ PUT_BITS_SP32(ur, r);
+ if (xneg && (ux != ur))
+ /* We threw some bits away and x was negative */
+ return r - 1.0F;
+ else
+ return r;
+ }
+}
+
+weak_alias (__floorf, floorf)
diff --git a/src/frexp.c b/src/frexp.c
new file mode 100644
index 0000000..0ae109c
--- /dev/null
+++ b/src/frexp.c
@@ -0,0 +1,54 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+double FN_PROTOTYPE(frexp)(double value, int *exp)
+{
+ UT64 val;
+ unsigned int sign;
+ int exponent;
+ val.f64 = value;
+ sign = val.u32[1] & SIGNBIT_SP32;
+ val.u32[1] = val.u32[1] & ~SIGNBIT_SP32; /* remove the sign bit */
+ *exp = 0;
+ if((val.f64 == 0.0) || ((val.u32[1] & 0x7ff00000)== 0x7ff00000))
+ return value; /* value= +-0 or value= nan or value = +-inf return value */
+
+ exponent = val.u32[1] >> 20; /* get the exponent */
+
+ if(exponent == 0)/*x is denormal*/
+ {
+ val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/
+ exponent = val.u32[1] >> 20; /* get the exponent */
+ exponent = exponent - MULTIPLIER_DP;
+ }
+
+ exponent -= 1022; /* remove bias(1023)-1 */
+ *exp = exponent; /* set the integral power of two */
+ val.u32[1] = sign | 0x3fe00000 | (val.u32[1] & 0x000fffff);/* make the fractional part(divide by 2) */
+ return val.f64;
+}
+
diff --git a/src/frexpf.c b/src/frexpf.c
new file mode 100644
index 0000000..e2b4ece
--- /dev/null
+++ b/src/frexpf.c
@@ -0,0 +1,55 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+
+float FN_PROTOTYPE(frexpf)(float value, int *exp)
+{
+ UT32 val;
+ unsigned int sign;
+ int exponent;
+ val.f32 = value;
+ sign = val.u32 & SIGNBIT_SP32;
+ val.u32 = val.u32 & ~SIGNBIT_SP32; /* remove the sign bit */
+ *exp = 0;
+ if((val.f32 == 0.0) || ((val.u32 & 0x7f800000)== 0x7f800000))
+ return value; /* value= +-0 or value= nan or value = +-inf return value */
+
+ exponent = val.u32 >> 23; /* get the exponent */
+
+ if(exponent == 0)/*x is denormal*/
+ {
+ val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/
+ exponent = (val.u32 >> 23); /* get the exponent */
+ exponent = exponent - MULTIPLIER_SP;
+ }
+
+ exponent -= 126; /* remove bias(127)-1 */
+ *exp = exponent; /* set the integral power of two */
+ val.u32 = sign | 0x3f000000 | (val.u32 & 0x007fffff);/* make the fractional part(divide by 2) */
+ return val.f32;
+}
+
diff --git a/src/gas/cbrt.S b/src/gas/cbrt.S
new file mode 100644
index 0000000..b733a1a
--- /dev/null
+++ b/src/gas/cbrt.S
@@ -0,0 +1,1575 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# cbrt.S
+#
+# An implementation of the cbrt libm function.
+#
+# Prototype:
+#
+# double cbrt(double x);
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(cbrt)
+#define fname_special _cbrt_special
+
+
+# local variable storage offsets
+
+.equ store_input, -0x10
+.equ stack_size, 0x20
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 32
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+ xor %rdx,%rdx
+ #for the time being the stack pointer is not changed at all
+ #Assuming that this is a leaf procedure we can avoid the decrementing and incrementing
+ #of the stack pointer. This will save some assembly operations and give us good performance
+ #results. If there is a procedure call then we need to look at the changes in the stack pointer.
+ #sub $stack_size, %rsp
+ movd %xmm0,%rax
+ movsd %xmm0,%xmm6
+ mov .L__exp_mask_64(%rip),%r10
+ mov .L__mantissa_mask_64(%rip),%r11
+ mov %rax,%r9
+ and %r10,%rax # rax = stores the exponent
+ and %r11,%r9 # r9 = stores the mantissa
+ shr $52,%rax
+ cmp $0X7FF,%rax
+ jz .L__cbrt_is_Nan_Infinite
+ cmp $0X0,%rax
+ jz .L__cbrt_is_denormal
+
+.align 32
+.L__cbrt_is_normal:
+ mov $3,%rcx # cx is set to 3 to perform division and get the scale and remainder
+ pand .L__sign_bit_64(%rip),%xmm6 # xmm6 contains the sign
+ sub $0x3FF,%ax
+ #we don't need the compare as sub instruction will raise the flags. But there was no performance improvement
+ cmp $0,%ax
+ jge .L__donot_change_dx
+ not %dx
+.L__donot_change_dx:
+ idiv %cx #Accumulator is divided by bl=3
+ #ax contains the quotient
+ #dx contains the remainder
+ mov %dx,%cx
+ add $0x3FF,%ax
+ shl $52,%rax
+ add $2,%cx
+ shl $1,%cx
+ #ax = Contains the quotient, Scale factor
+ mov %rax,store_input(%rsp)
+ movsd store_input(%rsp),%xmm7 #xmm7 is the scaling factor = mf
+ #xmm0 is the modified input value from the denaormal cases
+ pand .L__mantissa_mask_64(%rip),%xmm0
+ por .L__zero_point_five(%rip),%xmm0 #xmm0 = Y
+ mov %r9,%r10
+ shr $43,%r10
+ shr $44,%r9
+ and $0x01,%r10
+ or $0x0100,%r9
+ add %r9,%r10 #r10 = index_u64
+ cvtsi2sd %r10,%xmm4 #xmm4 = index_f64
+ sub $256,%r10
+ lea .L__INV_TAB_256(%rip),%rax
+ mulsd .L__one_by_512(%rip), %xmm4 #xmm4 = F
+ subsd %xmm4,%xmm0 # xmm0 = f
+ movsd (%rax,%r10,8),%xmm4
+ mulsd %xmm4,%xmm0 # xmm0 = r
+
+ #Now perform polynomial computation
+
+ # movddup %xmm0,%xmm0 # xmm0 = r ,r
+ shufpd $0,%xmm0,%xmm0 # replacing movddup
+
+ mulsd %xmm0,%xmm0 # xmm0 = r ,r^2
+
+ movapd %xmm0,%xmm4 # xmm4 = r ,r^2
+ movapd %xmm0,%xmm3 # xmm3 = r ,r^2
+ mulpd %xmm0,%xmm0 # xmm0 = r^2,r^4 #########
+ mulpd %xmm0,%xmm3 # xmm3 = r^3,r^6 #########
+ movapd %xmm3,%xmm2
+ mulpd .L__coefficients_3_6(%rip),%xmm2 # xmm2 = [coeff3 * r^3, coeff6 * r^6]
+ mulpd %xmm0,%xmm3 # xmm3 = r^5,r^10 We don't need r^10
+ unpckhpd %xmm3,%xmm4 #xmm4 = r^5,r
+ mulpd .L__coefficients_2_4(%rip),%xmm0 # xmm0 = [coeff2 * r^2, coeff4 * r^4]
+ mulpd .L__coefficients_5_1(%rip),%xmm4 # xmm4 = [coeff5 * r^5, coeff1 * r ]
+ movapd %xmm4,%xmm3
+ unpckhpd %xmm3,%xmm3 #xmm3 = [~Don't Care ,coeff5 * r^5]
+ addsd %xmm3,%xmm2 # xmm2 = [coeff3 * r^3, coeff5 * r^5 + coeff6 * r^6]
+ addpd %xmm2,%xmm0 # xmm0 = [coeff2 * r^2 + coeff3 * r^3,coeff4 * r^4 + coeff5 * r^5 + coeff6 * r^6]
+ movapd %xmm0,%xmm2
+ unpckhpd %xmm2,%xmm2 #xmm3 = [~Don't Care ,coeff2 * r^2 + coeff3 * r^3]
+ addsd %xmm2,%xmm0 # xmm0 = [~Don't Care, coeff2 * r^2 + coeff3 * r^3 + coeff4 * r^4 + coeff5 * r^5 + coeff6 * r^6]
+ addsd %xmm4,%xmm0 # xmm0 = [~Don't Care, coeff1 * r + coeff2 * r^2 + coeff3 * r^3 + coeff4 * r^4 + coeff5 * r^5 + coeff6 * r^6]
+
+ # movddup %xmm0,%xmm0
+ shufpd $0,%xmm0,%xmm0 # replacing movddup
+
+
+ #Polynomial computation completes here
+ #Now compute the following
+ #switch(rem)
+ #{
+ # case -2: cbrtRem_h.u64 = 0x3fe428a2f0000000; cbrtRem_t.u64 = 0x3e531ae515c447bb; break;
+ # case -1: cbrtRem_h.u64 = 0x3fe965fea0000000; cbrtRem_t.u64 = 0x3e44f5b8f20ac166; break;
+ # case 0: cbrtRem_h.u64 = 0x3ff0000000000000; cbrtRem_t.u64 = 0x0000000000000000; break;
+ # case 1: cbrtRem_h.u64 = 0x3ff428a2f0000000; cbrtRem_t.u64 = 0x3e631ae515c447bb; break;
+ # case 2: cbrtRem_h.u64 = 0x3ff965fea0000000; cbrtRem_t.u64 = 0x3e54f5b8f20ac166; break;
+ # default: break;
+ #}
+ #cbrtF_h.u64 = CBRT_F_H[index_u64-256];
+ #cbrtF_t.u64 = CBRT_F_T[index_u64-256];
+ #
+ #bH = (cbrtF_h.f64 * cbrtRem_h.f64);
+ #bT = ((((cbrtF_t.f64 * cbrtRem_t.f64)) + (cbrtF_t.f64 * cbrtRem_h.f64)) + (cbrtRem_t.f64 * cbrtF_h.f64));
+ lea .L__cuberoot_remainder_h_l(%rip),%r8 # load both head and tail of the remainders cuberoot at once
+ movapd (%r8,%rcx,8),%xmm1 # xmm1 = [cbrtRem_h.f64,cbrtRem_t.f64]
+ shl $1,%r10
+ lea .L__CBRT_F_H_L_256(%rip),%rax
+ movapd (%rax,%r10,8),%xmm2 # xmm2 = [cbrtF_h.f64,cbrtF_t.f64]
+ movapd %xmm2,%xmm3
+ psrldq $8,%xmm3 # xmm3 = [~Dont Care,cbrtF_h.f64]
+ unpcklpd %xmm2,%xmm3 # xmm3 = [cbrtF_t.f64,cbrtF_h.f64]
+
+ mulpd %xmm1,%xmm2 # xmm2 = [(cbrtF_h.f64*cbrtRem_h.f64),(cbrtRem_t.f64*cbrtF_t.f64)]
+ mulpd %xmm1,%xmm3 # xmm3 = [(cbrtRem_h.f64*cbrtF_t.f64),(cbrtRem_t.f64*cbrtF_h.f64)]
+ movapd %xmm3,%xmm4
+ unpckhpd %xmm4,%xmm4 # xmm4 = [(cbrtRem_h.f64*cbrtF_t.f64),(cbrtRem_h.f64*cbrtF_t.f64)]
+ addsd %xmm4,%xmm3 # xmm3 = [~Dont Care, ((cbrtRem_h.f64*cbrtF_t.f64) + (cbrtRem_t.f64*cbrtF_h.f64))]
+ addsd %xmm3,%xmm2 # xmm2 = [(cbrtF_h.f64*cbrtRem_h.f64),(((cbrtRem_t.f64*cbrtF_t.f64)+(cbrtRem_h.f64*cbrtF_t.f64) + (cbrtRem_t.f64*cbrtF_h.f64))]
+ # xmm2 = [bH,bT]
+ # Now calculate
+ #ans.f64 = (((((z * bT)) + (bT)) + (z * bH)) + (bH));
+ #ans.f64 = ans.f64 * mf;
+ #ans.u64 = ans.u64 | sign.u64;
+
+ movapd %xmm2,%xmm3
+ unpckhpd %xmm3,%xmm3 # xmm3 = [Dont Care,bH]
+ # also xmm0 = [z,z] = the polynomial which was computed earlier
+ mulpd %xmm2,%xmm0 # xmm0 = [(bH*z),(bT*z)]
+ movapd %xmm0,%xmm4
+ unpckhpd %xmm4,%xmm4 # xmm4 = [(bH*z),(bH*z)]
+ addsd %xmm2,%xmm0 # xmm0 = [~DontCare, ((bT*z) + bT)]
+ unpckhpd %xmm2,%xmm2 # xmm2 = [(bH),(bH)]
+ addsd %xmm4,%xmm0 # xmm0 = [~DontCare, (((bT*z) + bT) + ( z*bH))]
+ addsd %xmm2,%xmm0 # xmm0 = [~DontCare, ((((bT*z) + bT) + (z*bH)) + bH)] = [~Dont Care,ans.f64]
+ mulsd %xmm7,%xmm0 # xmm0 = ans.f64 * mf; mf is the scaling factor
+ por %xmm6,%xmm0 # restore the sign
+ #add $stack_size, %rsp
+ ret
+
+
+.align 32
+.L__cbrt_is_denormal:
+ movsd .L__one_mask_64(%rip),%xmm4
+ cmp $0,%r9
+ jz .L__cbrt_is_zero
+ pand .L__sign_mask_64(%rip),%xmm0
+ por %xmm4,%xmm0
+ subsd %xmm4,%xmm0
+ movd %xmm0,%rax
+ mov %rax,%r9
+ and %r10,%rax # rax = stores the exponent
+ and %r11,%r9 # r9 = stores the mantissa
+ shr $52,%rax
+ sub $1022,%rax
+ jmp .L__cbrt_is_normal
+
+.align 32
+.L__cbrt_is_zero:
+ ret
+.align 32
+.L__cbrt_is_Nan_Infinite:
+ cmp $0,%r9
+ jz .L__cbrt_is_Infinite
+ mulsd %xmm0,%xmm0 #this multiplication will raise an invalid exception
+ por .L__qnan_mask_64(%rip),%xmm0
+.L__cbrt_is_Infinite:
+ #add $stack_size, %rsp
+ ret
+
+.align 32
+.L__mantissa_mask_64: .quad 0x000FFFFFFFFFFFFF
+ .quad 0 #this zero is necessary
+.L__qnan_mask_64: .quad 0x0008000000000000
+.L__exp_mask_64: .quad 0x7FF0000000000000
+ .quad 0
+.L__zero: .quad 0x0000000000000000
+ .quad 0
+.align 32
+.L__zero_point_five: .quad 0x3FE0000000000000
+ .quad 0
+.align 16
+.L__sign_mask_64: .quad 0x7FFFFFFFFFFFFFFF
+ .quad 0
+.L__sign_bit_64: .quad 0x8000000000000000
+ .quad 0
+.L__one_mask_64: .quad 0x3FF0000000000000
+ .quad 0
+.L__one_by_512: .quad 0x3f60000000000000
+ .quad 0
+
+
+.align 16
+.L__denormal_factor: .quad 0x3F7428A2F98D728B
+ .quad 0
+# The coeeficients are arranged in a specific order to aid parrallel multiplication
+# The numbers corresponding to each coeff corresponds to the rth order to which it is to
+# be multiplied
+.L__coefficients:
+.align 32
+.L__coefficients_5_1: .quad 0x3fd5555555555555 # 1
+ .quad 0x3f9ee7113506ac13 # 5
+.L__coefficients_2_4: .quad 0xbfa511e8d2b3183b # 4
+ .quad 0xbfbc71c71c71c71c # 2
+.L__coefficients_3_6: .quad 0xbf98090d6221a247 # 6
+ .quad 0x3faf9add3c0ca458 # 3
+ .quad 0x3f93750ad588f114 # 7
+
+
+
+.align 32
+.L__cuberoot_remainder_h_l:
+ .quad 0x3e531ae515c447bb # cbrt(2^-2) Low
+ .quad 0x3FE428A2F0000000 # cbrt(2^-2) High
+ .quad 0x3e44f5b8f20ac166 # cbrt(2^-1) Low
+ .quad 0x3FE965FEA0000000 # cbrt(2^-1) High
+ .quad 0x0000000000000000 # cbrt(2^0) Low
+ .quad 0x3FF0000000000000 # cbrt(2^0) High
+ .quad 0x3e631ae515c447bb # cbrt(2^1) Low
+ .quad 0x3FF428A2F0000000 # cbrt(2^1) High
+ .quad 0x3e54f5b8f20ac166 # cbrt(2^2) Low
+ .quad 0x3FF965FEA0000000 # cbrt(2^2) High
+
+
+
+#interleaved high and low values
+.align 32
+.L__CBRT_F_H_L_256:
+ .quad 0x0000000000000000
+ .quad 0x3ff0000000000000
+ .quad 0x3e6e6a24c81e4294
+ .quad 0x3ff0055380000000
+ .quad 0x3e58548511e3a785
+ .quad 0x3ff00aa390000000
+ .quad 0x3e64eb9336ec07f6
+ .quad 0x3ff00ff010000000
+ .quad 0x3e40ea64b8b750e1
+ .quad 0x3ff0153920000000
+ .quad 0x3e461637cff8a53c
+ .quad 0x3ff01a7eb0000000
+ .quad 0x3e40733bf7bd1943
+ .quad 0x3ff01fc0d0000000
+ .quad 0x3e5666911345cced
+ .quad 0x3ff024ff80000000
+ .quad 0x3e477b7a3f592f14
+ .quad 0x3ff02a3ad0000000
+ .quad 0x3e6f18d3dd1a5402
+ .quad 0x3ff02f72b0000000
+ .quad 0x3e2be2f5a58ee9a4
+ .quad 0x3ff034a750000000
+ .quad 0x3e68901f8f085fa7
+ .quad 0x3ff039d880000000
+ .quad 0x3e5c68b8cd5b5d69
+ .quad 0x3ff03f0670000000
+ .quad 0x3e5a6b0e8624be42
+ .quad 0x3ff0443110000000
+ .quad 0x3dbc4b22b06f68e7
+ .quad 0x3ff0495870000000
+ .quad 0x3e60f3f0afcabe9b
+ .quad 0x3ff04e7c80000000
+ .quad 0x3e548495bca4e1b7
+ .quad 0x3ff0539d60000000
+ .quad 0x3e66107f1abdfdc3
+ .quad 0x3ff058bb00000000
+ .quad 0x3e6e67261878288a
+ .quad 0x3ff05dd570000000
+ .quad 0x3e5a6bc155286f1e
+ .quad 0x3ff062ecc0000000
+ .quad 0x3e58a759c64a85f2
+ .quad 0x3ff06800e0000000
+ .quad 0x3e45fce70a4a8d09
+ .quad 0x3ff06d11e0000000
+ .quad 0x3e32f9cbf373fe1d
+ .quad 0x3ff0721fc0000000
+ .quad 0x3e590564ce4ac359
+ .quad 0x3ff0772a80000000
+ .quad 0x3e5ac29ce761b02f
+ .quad 0x3ff07c3230000000
+ .quad 0x3e5cb752f497381c
+ .quad 0x3ff08136d0000000
+ .quad 0x3e68bb9e1cfb35e0
+ .quad 0x3ff0863860000000
+ .quad 0x3e65b4917099de90
+ .quad 0x3ff08b36f0000000
+ .quad 0x3e5cc77ac9c65ef2
+ .quad 0x3ff0903280000000
+ .quad 0x3e57a0f3e7be3dba
+ .quad 0x3ff0952b10000000
+ .quad 0x3e66ec851ee0c16f
+ .quad 0x3ff09a20a0000000
+ .quad 0x3e689449bf2946da
+ .quad 0x3ff09f1340000000
+ .quad 0x3e698f25301ba223
+ .quad 0x3ff0a402f0000000
+ .quad 0x3e347d5ec651f549
+ .quad 0x3ff0a8efc0000000
+ .quad 0x3e6c33ec9a86007a
+ .quad 0x3ff0add990000000
+ .quad 0x3e5e0b6653e92649
+ .quad 0x3ff0b2c090000000
+ .quad 0x3e3bd64ac09d755f
+ .quad 0x3ff0b7a4b0000000
+ .quad 0x3e2f537506f78167
+ .quad 0x3ff0bc85f0000000
+ .quad 0x3e62c382d1b3735e
+ .quad 0x3ff0c16450000000
+ .quad 0x3e6e20ed659f99e1
+ .quad 0x3ff0c63fe0000000
+ .quad 0x3e586b633a9c182a
+ .quad 0x3ff0cb18b0000000
+ .quad 0x3e445cfd5a65e777
+ .quad 0x3ff0cfeeb0000000
+ .quad 0x3e60c8770f58bca4
+ .quad 0x3ff0d4c1e0000000
+ .quad 0x3e6739e44b0933c5
+ .quad 0x3ff0d99250000000
+ .quad 0x3e027dc3d9ce7bd8
+ .quad 0x3ff0de6010000000
+ .quad 0x3e63c53c7c5a7b64
+ .quad 0x3ff0e32b00000000
+ .quad 0x3e69669683830cec
+ .quad 0x3ff0e7f340000000
+ .quad 0x3e68d772c39bdcc4
+ .quad 0x3ff0ecb8d0000000
+ .quad 0x3e69b0008bcf6d7b
+ .quad 0x3ff0f17bb0000000
+ .quad 0x3e3bbb305825ce4f
+ .quad 0x3ff0f63bf0000000
+ .quad 0x3e6da3f4af13a406
+ .quad 0x3ff0faf970000000
+ .quad 0x3e5f36b96f74ce86
+ .quad 0x3ff0ffb460000000
+ .quad 0x3e165c002303f790
+ .quad 0x3ff1046cb0000000
+ .quad 0x3e682f84095ba7d5
+ .quad 0x3ff1092250000000
+ .quad 0x3e6d46433541b2c6
+ .quad 0x3ff10dd560000000
+ .quad 0x3e671c3d56e93a89
+ .quad 0x3ff11285e0000000
+ .quad 0x3e598dcef4e40012
+ .quad 0x3ff11733d0000000
+ .quad 0x3e4530ebef17fe03
+ .quad 0x3ff11bdf30000000
+ .quad 0x3e4e8b8fa3715066
+ .quad 0x3ff1208800000000
+ .quad 0x3e6ab26eb3b211dc
+ .quad 0x3ff1252e40000000
+ .quad 0x3e454dd4dc906307
+ .quad 0x3ff129d210000000
+ .quad 0x3e5c9f962387984e
+ .quad 0x3ff12e7350000000
+ .quad 0x3e6c62a959afec09
+ .quad 0x3ff1331210000000
+ .quad 0x3e6638d9ac6a866a
+ .quad 0x3ff137ae60000000
+ .quad 0x3e338704eca8a22d
+ .quad 0x3ff13c4840000000
+ .quad 0x3e4e6c9e1db14f8f
+ .quad 0x3ff140dfa0000000
+ .quad 0x3e58744b7f9c9eaa
+ .quad 0x3ff1457490000000
+ .quad 0x3e66c2893486373b
+ .quad 0x3ff14a0710000000
+ .quad 0x3e5b36bce31699b7
+ .quad 0x3ff14e9730000000
+ .quad 0x3e671e3813d200c7
+ .quad 0x3ff15324e0000000
+ .quad 0x3e699755ab40aa88
+ .quad 0x3ff157b030000000
+ .quad 0x3e6b45ca0e4bcfc0
+ .quad 0x3ff15c3920000000
+ .quad 0x3e32dd090d869c5d
+ .quad 0x3ff160bfc0000000
+ .quad 0x3e64fe0516b917da
+ .quad 0x3ff16543f0000000
+ .quad 0x3e694563226317a2
+ .quad 0x3ff169c5d0000000
+ .quad 0x3e653d8fafc2c851
+ .quad 0x3ff16e4560000000
+ .quad 0x3e5dcbd41fbd41a3
+ .quad 0x3ff172c2a0000000
+ .quad 0x3e5862ff5285f59c
+ .quad 0x3ff1773d90000000
+ .quad 0x3e63072ea97a1e1c
+ .quad 0x3ff17bb630000000
+ .quad 0x3e52839075184805
+ .quad 0x3ff1802c90000000
+ .quad 0x3e64b0323e9eff42
+ .quad 0x3ff184a0a0000000
+ .quad 0x3e6b158893c45484
+ .quad 0x3ff1891270000000
+ .quad 0x3e3149ef0fc35826
+ .quad 0x3ff18d8210000000
+ .quad 0x3e5f2e77ea96acaa
+ .quad 0x3ff191ef60000000
+ .quad 0x3e5200074c471a95
+ .quad 0x3ff1965a80000000
+ .quad 0x3e63f8cc517f6f04
+ .quad 0x3ff19ac360000000
+ .quad 0x3e660ba2e311bb55
+ .quad 0x3ff19f2a10000000
+ .quad 0x3e64b788730bbec3
+ .quad 0x3ff1a38e90000000
+ .quad 0x3e657090795ee20c
+ .quad 0x3ff1a7f0e0000000
+ .quad 0x3e6d9ffe983670b1
+ .quad 0x3ff1ac5100000000
+ .quad 0x3e62a463ff61bfda
+ .quad 0x3ff1b0af00000000
+ .quad 0x3e69d1bc6a5e65cf
+ .quad 0x3ff1b50ad0000000
+ .quad 0x3e68718abaa9e922
+ .quad 0x3ff1b96480000000
+ .quad 0x3e63c2f52ffa342e
+ .quad 0x3ff1bdbc10000000
+ .quad 0x3e60fae13ff42c80
+ .quad 0x3ff1c21180000000
+ .quad 0x3e65440f0ef00d57
+ .quad 0x3ff1c664d0000000
+ .quad 0x3e46fcd22d4e3c1e
+ .quad 0x3ff1cab610000000
+ .quad 0x3e4e0c60b409e863
+ .quad 0x3ff1cf0530000000
+ .quad 0x3e6f9cab5a5f0333
+ .quad 0x3ff1d35230000000
+ .quad 0x3e630f24744c333d
+ .quad 0x3ff1d79d30000000
+ .quad 0x3e4b50622a76b2fe
+ .quad 0x3ff1dbe620000000
+ .quad 0x3e6fdb94ba595375
+ .quad 0x3ff1e02cf0000000
+ .quad 0x3e3861b9b945a171
+ .quad 0x3ff1e471d0000000
+ .quad 0x3e654348015188c4
+ .quad 0x3ff1e8b490000000
+ .quad 0x3e6b54d149865523
+ .quad 0x3ff1ecf550000000
+ .quad 0x3e6a0bb783d9de33
+ .quad 0x3ff1f13410000000
+ .quad 0x3e6629d12b1a2157
+ .quad 0x3ff1f570d0000000
+ .quad 0x3e6467fe35d179df
+ .quad 0x3ff1f9ab90000000
+ .quad 0x3e69763f3e26c8f7
+ .quad 0x3ff1fde450000000
+ .quad 0x3e53f798bb9f7679
+ .quad 0x3ff2021b20000000
+ .quad 0x3e552e577e855898
+ .quad 0x3ff2064ff0000000
+ .quad 0x3e6fde47e5502c3a
+ .quad 0x3ff20a82c0000000
+ .quad 0x3e5cbd0b548d96a0
+ .quad 0x3ff20eb3b0000000
+ .quad 0x3e6a9cd9f7be8de8
+ .quad 0x3ff212e2a0000000
+ .quad 0x3e522bbe704886de
+ .quad 0x3ff2170fb0000000
+ .quad 0x3e6e3dea8317f020
+ .quad 0x3ff21b3ac0000000
+ .quad 0x3e6e812085ac8855
+ .quad 0x3ff21f63f0000000
+ .quad 0x3e5c87144f24cb07
+ .quad 0x3ff2238b40000000
+ .quad 0x3e61e128ee311fa2
+ .quad 0x3ff227b0a0000000
+ .quad 0x3e5b5c163d61a2d3
+ .quad 0x3ff22bd420000000
+ .quad 0x3e47d97e7fb90633
+ .quad 0x3ff22ff5c0000000
+ .quad 0x3e6efe899d50f6a7
+ .quad 0x3ff2341570000000
+ .quad 0x3e6d0333eb75de5a
+ .quad 0x3ff2383350000000
+ .quad 0x3e40e590be73a573
+ .quad 0x3ff23c4f60000000
+ .quad 0x3e68ce8dcac3cdd2
+ .quad 0x3ff2406980000000
+ .quad 0x3e6ee8a48954064b
+ .quad 0x3ff24481d0000000
+ .quad 0x3e6aa62f18461e09
+ .quad 0x3ff2489850000000
+ .quad 0x3e601e5940986a15
+ .quad 0x3ff24cad00000000
+ .quad 0x3e3b082f4f9b8d4c
+ .quad 0x3ff250bfe0000000
+ .quad 0x3e6876e0e5527f5a
+ .quad 0x3ff254d0e0000000
+ .quad 0x3e63617080831e6b
+ .quad 0x3ff258e020000000
+ .quad 0x3e681b26e34aa4a2
+ .quad 0x3ff25ced90000000
+ .quad 0x3e552ee66dfab0c1
+ .quad 0x3ff260f940000000
+ .quad 0x3e5d85a5329e8819
+ .quad 0x3ff2650320000000
+ .quad 0x3e5105c1b646b5d1
+ .quad 0x3ff2690b40000000
+ .quad 0x3e6bb6690c1a379c
+ .quad 0x3ff26d1190000000
+ .quad 0x3e586aeba73ce3a9
+ .quad 0x3ff2711630000000
+ .quad 0x3e6dd16198294dd4
+ .quad 0x3ff2751900000000
+ .quad 0x3e6454e675775e83
+ .quad 0x3ff2791a20000000
+ .quad 0x3e63842e026197ea
+ .quad 0x3ff27d1980000000
+ .quad 0x3e6f1ce0e70c44d2
+ .quad 0x3ff2811720000000
+ .quad 0x3e6ad636441a5627
+ .quad 0x3ff2851310000000
+ .quad 0x3e54c205d7212abb
+ .quad 0x3ff2890d50000000
+ .quad 0x3e6167c86c116419
+ .quad 0x3ff28d05d0000000
+ .quad 0x3e638ec3ef16e294
+ .quad 0x3ff290fca0000000
+ .quad 0x3e6473fceace9321
+ .quad 0x3ff294f1c0000000
+ .quad 0x3e67af53a836dba7
+ .quad 0x3ff298e530000000
+ .quad 0x3e1a51f3c383b652
+ .quad 0x3ff29cd700000000
+ .quad 0x3e63696da190822d
+ .quad 0x3ff2a0c710000000
+ .quad 0x3e62f9adec77074b
+ .quad 0x3ff2a4b580000000
+ .quad 0x3e38190fd5bee55f
+ .quad 0x3ff2a8a250000000
+ .quad 0x3e4bfee8fac68e55
+ .quad 0x3ff2ac8d70000000
+ .quad 0x3e331c9d6bc5f68a
+ .quad 0x3ff2b076f0000000
+ .quad 0x3e689d0523737edf
+ .quad 0x3ff2b45ec0000000
+ .quad 0x3e5a295943bf47bb
+ .quad 0x3ff2b84500000000
+ .quad 0x3e396be32e5b3207
+ .quad 0x3ff2bc29a0000000
+ .quad 0x3e6e44c7d909fa0e
+ .quad 0x3ff2c00c90000000
+ .quad 0x3e2b2505da94d9ea
+ .quad 0x3ff2c3ee00000000
+ .quad 0x3e60c851f46c9c98
+ .quad 0x3ff2c7cdc0000000
+ .quad 0x3e5da71f7d9aa3b7
+ .quad 0x3ff2cbabf0000000
+ .quad 0x3e6f1b605d019ef1
+ .quad 0x3ff2cf8880000000
+ .quad 0x3e4386e8a2189563
+ .quad 0x3ff2d36390000000
+ .quad 0x3e3b19fa5d306ba7
+ .quad 0x3ff2d73d00000000
+ .quad 0x3e6dd749b67aef76
+ .quad 0x3ff2db14d0000000
+ .quad 0x3e676ff6f1dc04b0
+ .quad 0x3ff2deeb20000000
+ .quad 0x3e635a33d0b232a6
+ .quad 0x3ff2e2bfe0000000
+ .quad 0x3e64bdc80024a4e1
+ .quad 0x3ff2e69310000000
+ .quad 0x3e6ebd61770fd723
+ .quad 0x3ff2ea64b0000000
+ .quad 0x3e64769fc537264d
+ .quad 0x3ff2ee34d0000000
+ .quad 0x3e69021f429f3b98
+ .quad 0x3ff2f20360000000
+ .quad 0x3e5ee7083efbd606
+ .quad 0x3ff2f5d070000000
+ .quad 0x3e6ad985552a6b1a
+ .quad 0x3ff2f99bf0000000
+ .quad 0x3e6e3df778772160
+ .quad 0x3ff2fd65f0000000
+ .quad 0x3e6ca5d76ddc9b34
+ .quad 0x3ff3012e70000000
+ .quad 0x3e691154ffdbaf74
+ .quad 0x3ff304f570000000
+ .quad 0x3e667bdd57fb306a
+ .quad 0x3ff308baf0000000
+ .quad 0x3e67dc255ac40886
+ .quad 0x3ff30c7ef0000000
+ .quad 0x3df219f38e8afafe
+ .quad 0x3ff3104180000000
+ .quad 0x3e62416bf9669a04
+ .quad 0x3ff3140280000000
+ .quad 0x3e611c96b2b3987f
+ .quad 0x3ff317c210000000
+ .quad 0x3e6f99ed447e1177
+ .quad 0x3ff31b8020000000
+ .quad 0x3e13245826328a11
+ .quad 0x3ff31f3cd0000000
+ .quad 0x3e66f56dd1e645f8
+ .quad 0x3ff322f7f0000000
+ .quad 0x3e46164946945535
+ .quad 0x3ff326b1b0000000
+ .quad 0x3e5e37d59d190028
+ .quad 0x3ff32a69f0000000
+ .quad 0x3e668671f12bf828
+ .quad 0x3ff32e20c0000000
+ .quad 0x3e6e8ecbca6aabbd
+ .quad 0x3ff331d620000000
+ .quad 0x3e53f49e109a5912
+ .quad 0x3ff3358a20000000
+ .quad 0x3e6b8a0e11ec3043
+ .quad 0x3ff3393ca0000000
+ .quad 0x3e65fae00aed691a
+ .quad 0x3ff33cedc0000000
+ .quad 0x3e6c0569bece3e4a
+ .quad 0x3ff3409d70000000
+ .quad 0x3e605e26744efbfe
+ .quad 0x3ff3444bc0000000
+ .quad 0x3e65b570a94be5c5
+ .quad 0x3ff347f8a0000000
+ .quad 0x3e5d6f156ea0e063
+ .quad 0x3ff34ba420000000
+ .quad 0x3e6e0ca7612fc484
+ .quad 0x3ff34f4e30000000
+ .quad 0x3e4963c927b25258
+ .quad 0x3ff352f6f0000000
+ .quad 0x3e547930aa725a5c
+ .quad 0x3ff3569e40000000
+ .quad 0x3e58a79fe3af43b3
+ .quad 0x3ff35a4430000000
+ .quad 0x3e5e6dc29c41bdaf
+ .quad 0x3ff35de8c0000000
+ .quad 0x3e657a2e76f863a5
+ .quad 0x3ff3618bf0000000
+ .quad 0x3e2ae3b61716354d
+ .quad 0x3ff3652dd0000000
+ .quad 0x3e665fb5df6906b1
+ .quad 0x3ff368ce40000000
+ .quad 0x3e66177d7f588f7b
+ .quad 0x3ff36c6d60000000
+ .quad 0x3e3ad55abd091b67
+ .quad 0x3ff3700b30000000
+ .quad 0x3e155337b2422d76
+ .quad 0x3ff373a7a0000000
+ .quad 0x3e6084ebe86972d5
+ .quad 0x3ff37742b0000000
+ .quad 0x3e656395808e1ea3
+ .quad 0x3ff37adc70000000
+ .quad 0x3e61bce21b40fba7
+ .quad 0x3ff37e74e0000000
+ .quad 0x3e5006f94605b515
+ .quad 0x3ff3820c00000000
+ .quad 0x3e6aa676aceb1f7d
+ .quad 0x3ff385a1c0000000
+ .quad 0x3e58229f76554ce6
+ .quad 0x3ff3893640000000
+ .quad 0x3e6eabfc6cf57330
+ .quad 0x3ff38cc960000000
+ .quad 0x3e64daed9c0ce8bc
+ .quad 0x3ff3905b40000000
+ .quad 0x3e60ff1768237141
+ .quad 0x3ff393ebd0000000
+ .quad 0x3e6575f83051b085
+ .quad 0x3ff3977b10000000
+ .quad 0x3e42667deb523e29
+ .quad 0x3ff39b0910000000
+ .quad 0x3e1816996954f4fd
+ .quad 0x3ff39e95c0000000
+ .quad 0x3e587cfccf4d9cd4
+ .quad 0x3ff3a22120000000
+ .quad 0x3e52c5d018198353
+ .quad 0x3ff3a5ab40000000
+ .quad 0x3e6a7a898dcc34aa
+ .quad 0x3ff3a93410000000
+ .quad 0x3e2cead6dadc36d1
+ .quad 0x3ff3acbbb0000000
+ .quad 0x3e2a55759c498bdf
+ .quad 0x3ff3b04200000000
+ .quad 0x3e6c414a9ef6de04
+ .quad 0x3ff3b3c700000000
+ .quad 0x3e63e2108a6e58fa
+ .quad 0x3ff3b74ad0000000
+ .quad 0x3e5587fd7643d77c
+ .quad 0x3ff3bacd60000000
+ .quad 0x3e3901eb1d3ff3df
+ .quad 0x3ff3be4eb0000000
+ .quad 0x3e6f2ccd7c812fc6
+ .quad 0x3ff3c1ceb0000000
+ .quad 0x3e21c8ee70a01049
+ .quad 0x3ff3c54d90000000
+ .quad 0x3e563e8d02831eec
+ .quad 0x3ff3c8cb20000000
+ .quad 0x3e6f61a42a92c7ff
+ .quad 0x3ff3cc4770000000
+ .quad 0x3dda917399c84d24
+ .quad 0x3ff3cfc2a0000000
+ .quad 0x3e5e9197c8eec2f0
+ .quad 0x3ff3d33c80000000
+ .quad 0x3e5e6f842f5a1378
+ .quad 0x3ff3d6b530000000
+ .quad 0x3e2fac242a90a0fc
+ .quad 0x3ff3da2cb0000000
+ .quad 0x3e535ed726610227
+ .quad 0x3ff3dda2f0000000
+ .quad 0x3e50e0d64804b15b
+ .quad 0x3ff3e11800000000
+ .quad 0x3e0560675daba814
+ .quad 0x3ff3e48be0000000
+ .quad 0x3e637388c8768032
+ .quad 0x3ff3e7fe80000000
+ .quad 0x3e3ee3c89f9e01f5
+ .quad 0x3ff3eb7000000000
+ .quad 0x3e639f6f0d09747c
+ .quad 0x3ff3eee040000000
+ .quad 0x3e4322c327abb8f0
+ .quad 0x3ff3f24f60000000
+ .quad 0x3e6961b347c8ac80
+ .quad 0x3ff3f5bd40000000
+ .quad 0x3e63711fbbd0f118
+ .quad 0x3ff3f92a00000000
+ .quad 0x3e64fad8d7718ffb
+ .quad 0x3ff3fc9590000000
+ .quad 0x3e6fffffffffffff
+ .quad 0x3ff3fffff0000000
+ .quad 0x3e667efa79ec35b4
+ .quad 0x3ff4036930000000
+ .quad 0x3e6a737687a254a8
+ .quad 0x3ff406d140000000
+ .quad 0x3e5bace0f87d924d
+ .quad 0x3ff40a3830000000
+ .quad 0x3e629e37c237e392
+ .quad 0x3ff40d9df0000000
+ .quad 0x3e557ce7ac3f3012
+ .quad 0x3ff4110290000000
+ .quad 0x3e682829359f8fbd
+ .quad 0x3ff4146600000000
+ .quad 0x3e6cc9be42d14676
+ .quad 0x3ff417c850000000
+ .quad 0x3e6a8f001c137d0b
+ .quad 0x3ff41b2980000000
+ .quad 0x3e636127687dda05
+ .quad 0x3ff41e8990000000
+ .quad 0x3e524dba322646f0
+ .quad 0x3ff421e880000000
+ .quad 0x3e6dc43f1ed210b4
+ .quad 0x3ff4254640000000
+ .quad 0x3e631ae515c447bb
+ .quad 0x3ff428a2f0000000
+
+
+.align 32
+.L__CBRT_F_H_256: .quad 0x3ff0000000000000
+ .quad 0x3ff0055380000000
+ .quad 0x3ff00aa390000000
+ .quad 0x3ff00ff010000000
+ .quad 0x3ff0153920000000
+ .quad 0x3ff01a7eb0000000
+ .quad 0x3ff01fc0d0000000
+ .quad 0x3ff024ff80000000
+ .quad 0x3ff02a3ad0000000
+ .quad 0x3ff02f72b0000000
+ .quad 0x3ff034a750000000
+ .quad 0x3ff039d880000000
+ .quad 0x3ff03f0670000000
+ .quad 0x3ff0443110000000
+ .quad 0x3ff0495870000000
+ .quad 0x3ff04e7c80000000
+ .quad 0x3ff0539d60000000
+ .quad 0x3ff058bb00000000
+ .quad 0x3ff05dd570000000
+ .quad 0x3ff062ecc0000000
+ .quad 0x3ff06800e0000000
+ .quad 0x3ff06d11e0000000
+ .quad 0x3ff0721fc0000000
+ .quad 0x3ff0772a80000000
+ .quad 0x3ff07c3230000000
+ .quad 0x3ff08136d0000000
+ .quad 0x3ff0863860000000
+ .quad 0x3ff08b36f0000000
+ .quad 0x3ff0903280000000
+ .quad 0x3ff0952b10000000
+ .quad 0x3ff09a20a0000000
+ .quad 0x3ff09f1340000000
+ .quad 0x3ff0a402f0000000
+ .quad 0x3ff0a8efc0000000
+ .quad 0x3ff0add990000000
+ .quad 0x3ff0b2c090000000
+ .quad 0x3ff0b7a4b0000000
+ .quad 0x3ff0bc85f0000000
+ .quad 0x3ff0c16450000000
+ .quad 0x3ff0c63fe0000000
+ .quad 0x3ff0cb18b0000000
+ .quad 0x3ff0cfeeb0000000
+ .quad 0x3ff0d4c1e0000000
+ .quad 0x3ff0d99250000000
+ .quad 0x3ff0de6010000000
+ .quad 0x3ff0e32b00000000
+ .quad 0x3ff0e7f340000000
+ .quad 0x3ff0ecb8d0000000
+ .quad 0x3ff0f17bb0000000
+ .quad 0x3ff0f63bf0000000
+ .quad 0x3ff0faf970000000
+ .quad 0x3ff0ffb460000000
+ .quad 0x3ff1046cb0000000
+ .quad 0x3ff1092250000000
+ .quad 0x3ff10dd560000000
+ .quad 0x3ff11285e0000000
+ .quad 0x3ff11733d0000000
+ .quad 0x3ff11bdf30000000
+ .quad 0x3ff1208800000000
+ .quad 0x3ff1252e40000000
+ .quad 0x3ff129d210000000
+ .quad 0x3ff12e7350000000
+ .quad 0x3ff1331210000000
+ .quad 0x3ff137ae60000000
+ .quad 0x3ff13c4840000000
+ .quad 0x3ff140dfa0000000
+ .quad 0x3ff1457490000000
+ .quad 0x3ff14a0710000000
+ .quad 0x3ff14e9730000000
+ .quad 0x3ff15324e0000000
+ .quad 0x3ff157b030000000
+ .quad 0x3ff15c3920000000
+ .quad 0x3ff160bfc0000000
+ .quad 0x3ff16543f0000000
+ .quad 0x3ff169c5d0000000
+ .quad 0x3ff16e4560000000
+ .quad 0x3ff172c2a0000000
+ .quad 0x3ff1773d90000000
+ .quad 0x3ff17bb630000000
+ .quad 0x3ff1802c90000000
+ .quad 0x3ff184a0a0000000
+ .quad 0x3ff1891270000000
+ .quad 0x3ff18d8210000000
+ .quad 0x3ff191ef60000000
+ .quad 0x3ff1965a80000000
+ .quad 0x3ff19ac360000000
+ .quad 0x3ff19f2a10000000
+ .quad 0x3ff1a38e90000000
+ .quad 0x3ff1a7f0e0000000
+ .quad 0x3ff1ac5100000000
+ .quad 0x3ff1b0af00000000
+ .quad 0x3ff1b50ad0000000
+ .quad 0x3ff1b96480000000
+ .quad 0x3ff1bdbc10000000
+ .quad 0x3ff1c21180000000
+ .quad 0x3ff1c664d0000000
+ .quad 0x3ff1cab610000000
+ .quad 0x3ff1cf0530000000
+ .quad 0x3ff1d35230000000
+ .quad 0x3ff1d79d30000000
+ .quad 0x3ff1dbe620000000
+ .quad 0x3ff1e02cf0000000
+ .quad 0x3ff1e471d0000000
+ .quad 0x3ff1e8b490000000
+ .quad 0x3ff1ecf550000000
+ .quad 0x3ff1f13410000000
+ .quad 0x3ff1f570d0000000
+ .quad 0x3ff1f9ab90000000
+ .quad 0x3ff1fde450000000
+ .quad 0x3ff2021b20000000
+ .quad 0x3ff2064ff0000000
+ .quad 0x3ff20a82c0000000
+ .quad 0x3ff20eb3b0000000
+ .quad 0x3ff212e2a0000000
+ .quad 0x3ff2170fb0000000
+ .quad 0x3ff21b3ac0000000
+ .quad 0x3ff21f63f0000000
+ .quad 0x3ff2238b40000000
+ .quad 0x3ff227b0a0000000
+ .quad 0x3ff22bd420000000
+ .quad 0x3ff22ff5c0000000
+ .quad 0x3ff2341570000000
+ .quad 0x3ff2383350000000
+ .quad 0x3ff23c4f60000000
+ .quad 0x3ff2406980000000
+ .quad 0x3ff24481d0000000
+ .quad 0x3ff2489850000000
+ .quad 0x3ff24cad00000000
+ .quad 0x3ff250bfe0000000
+ .quad 0x3ff254d0e0000000
+ .quad 0x3ff258e020000000
+ .quad 0x3ff25ced90000000
+ .quad 0x3ff260f940000000
+ .quad 0x3ff2650320000000
+ .quad 0x3ff2690b40000000
+ .quad 0x3ff26d1190000000
+ .quad 0x3ff2711630000000
+ .quad 0x3ff2751900000000
+ .quad 0x3ff2791a20000000
+ .quad 0x3ff27d1980000000
+ .quad 0x3ff2811720000000
+ .quad 0x3ff2851310000000
+ .quad 0x3ff2890d50000000
+ .quad 0x3ff28d05d0000000
+ .quad 0x3ff290fca0000000
+ .quad 0x3ff294f1c0000000
+ .quad 0x3ff298e530000000
+ .quad 0x3ff29cd700000000
+ .quad 0x3ff2a0c710000000
+ .quad 0x3ff2a4b580000000
+ .quad 0x3ff2a8a250000000
+ .quad 0x3ff2ac8d70000000
+ .quad 0x3ff2b076f0000000
+ .quad 0x3ff2b45ec0000000
+ .quad 0x3ff2b84500000000
+ .quad 0x3ff2bc29a0000000
+ .quad 0x3ff2c00c90000000
+ .quad 0x3ff2c3ee00000000
+ .quad 0x3ff2c7cdc0000000
+ .quad 0x3ff2cbabf0000000
+ .quad 0x3ff2cf8880000000
+ .quad 0x3ff2d36390000000
+ .quad 0x3ff2d73d00000000
+ .quad 0x3ff2db14d0000000
+ .quad 0x3ff2deeb20000000
+ .quad 0x3ff2e2bfe0000000
+ .quad 0x3ff2e69310000000
+ .quad 0x3ff2ea64b0000000
+ .quad 0x3ff2ee34d0000000
+ .quad 0x3ff2f20360000000
+ .quad 0x3ff2f5d070000000
+ .quad 0x3ff2f99bf0000000
+ .quad 0x3ff2fd65f0000000
+ .quad 0x3ff3012e70000000
+ .quad 0x3ff304f570000000
+ .quad 0x3ff308baf0000000
+ .quad 0x3ff30c7ef0000000
+ .quad 0x3ff3104180000000
+ .quad 0x3ff3140280000000
+ .quad 0x3ff317c210000000
+ .quad 0x3ff31b8020000000
+ .quad 0x3ff31f3cd0000000
+ .quad 0x3ff322f7f0000000
+ .quad 0x3ff326b1b0000000
+ .quad 0x3ff32a69f0000000
+ .quad 0x3ff32e20c0000000
+ .quad 0x3ff331d620000000
+ .quad 0x3ff3358a20000000
+ .quad 0x3ff3393ca0000000
+ .quad 0x3ff33cedc0000000
+ .quad 0x3ff3409d70000000
+ .quad 0x3ff3444bc0000000
+ .quad 0x3ff347f8a0000000
+ .quad 0x3ff34ba420000000
+ .quad 0x3ff34f4e30000000
+ .quad 0x3ff352f6f0000000
+ .quad 0x3ff3569e40000000
+ .quad 0x3ff35a4430000000
+ .quad 0x3ff35de8c0000000
+ .quad 0x3ff3618bf0000000
+ .quad 0x3ff3652dd0000000
+ .quad 0x3ff368ce40000000
+ .quad 0x3ff36c6d60000000
+ .quad 0x3ff3700b30000000
+ .quad 0x3ff373a7a0000000
+ .quad 0x3ff37742b0000000
+ .quad 0x3ff37adc70000000
+ .quad 0x3ff37e74e0000000
+ .quad 0x3ff3820c00000000
+ .quad 0x3ff385a1c0000000
+ .quad 0x3ff3893640000000
+ .quad 0x3ff38cc960000000
+ .quad 0x3ff3905b40000000
+ .quad 0x3ff393ebd0000000
+ .quad 0x3ff3977b10000000
+ .quad 0x3ff39b0910000000
+ .quad 0x3ff39e95c0000000
+ .quad 0x3ff3a22120000000
+ .quad 0x3ff3a5ab40000000
+ .quad 0x3ff3a93410000000
+ .quad 0x3ff3acbbb0000000
+ .quad 0x3ff3b04200000000
+ .quad 0x3ff3b3c700000000
+ .quad 0x3ff3b74ad0000000
+ .quad 0x3ff3bacd60000000
+ .quad 0x3ff3be4eb0000000
+ .quad 0x3ff3c1ceb0000000
+ .quad 0x3ff3c54d90000000
+ .quad 0x3ff3c8cb20000000
+ .quad 0x3ff3cc4770000000
+ .quad 0x3ff3cfc2a0000000
+ .quad 0x3ff3d33c80000000
+ .quad 0x3ff3d6b530000000
+ .quad 0x3ff3da2cb0000000
+ .quad 0x3ff3dda2f0000000
+ .quad 0x3ff3e11800000000
+ .quad 0x3ff3e48be0000000
+ .quad 0x3ff3e7fe80000000
+ .quad 0x3ff3eb7000000000
+ .quad 0x3ff3eee040000000
+ .quad 0x3ff3f24f60000000
+ .quad 0x3ff3f5bd40000000
+ .quad 0x3ff3f92a00000000
+ .quad 0x3ff3fc9590000000
+ .quad 0x3ff3fffff0000000
+ .quad 0x3ff4036930000000
+ .quad 0x3ff406d140000000
+ .quad 0x3ff40a3830000000
+ .quad 0x3ff40d9df0000000
+ .quad 0x3ff4110290000000
+ .quad 0x3ff4146600000000
+ .quad 0x3ff417c850000000
+ .quad 0x3ff41b2980000000
+ .quad 0x3ff41e8990000000
+ .quad 0x3ff421e880000000
+ .quad 0x3ff4254640000000
+
+.align 32
+.L__CBRT_F_T_256: .quad 0x0000000000000000
+ .quad 0x3e6e6a24c81e4294
+ .quad 0x3e58548511e3a785
+ .quad 0x3e64eb9336ec07f6
+ .quad 0x3e40ea64b8b750e1
+ .quad 0x3e461637cff8a53c
+ .quad 0x3e40733bf7bd1943
+ .quad 0x3e5666911345cced
+ .quad 0x3e477b7a3f592f14
+ .quad 0x3e6f18d3dd1a5402
+ .quad 0x3e2be2f5a58ee9a4
+ .quad 0x3e68901f8f085fa7
+ .quad 0x3e5c68b8cd5b5d69
+ .quad 0x3e5a6b0e8624be42
+ .quad 0x3dbc4b22b06f68e7
+ .quad 0x3e60f3f0afcabe9b
+ .quad 0x3e548495bca4e1b7
+ .quad 0x3e66107f1abdfdc3
+ .quad 0x3e6e67261878288a
+ .quad 0x3e5a6bc155286f1e
+ .quad 0x3e58a759c64a85f2
+ .quad 0x3e45fce70a4a8d09
+ .quad 0x3e32f9cbf373fe1d
+ .quad 0x3e590564ce4ac359
+ .quad 0x3e5ac29ce761b02f
+ .quad 0x3e5cb752f497381c
+ .quad 0x3e68bb9e1cfb35e0
+ .quad 0x3e65b4917099de90
+ .quad 0x3e5cc77ac9c65ef2
+ .quad 0x3e57a0f3e7be3dba
+ .quad 0x3e66ec851ee0c16f
+ .quad 0x3e689449bf2946da
+ .quad 0x3e698f25301ba223
+ .quad 0x3e347d5ec651f549
+ .quad 0x3e6c33ec9a86007a
+ .quad 0x3e5e0b6653e92649
+ .quad 0x3e3bd64ac09d755f
+ .quad 0x3e2f537506f78167
+ .quad 0x3e62c382d1b3735e
+ .quad 0x3e6e20ed659f99e1
+ .quad 0x3e586b633a9c182a
+ .quad 0x3e445cfd5a65e777
+ .quad 0x3e60c8770f58bca4
+ .quad 0x3e6739e44b0933c5
+ .quad 0x3e027dc3d9ce7bd8
+ .quad 0x3e63c53c7c5a7b64
+ .quad 0x3e69669683830cec
+ .quad 0x3e68d772c39bdcc4
+ .quad 0x3e69b0008bcf6d7b
+ .quad 0x3e3bbb305825ce4f
+ .quad 0x3e6da3f4af13a406
+ .quad 0x3e5f36b96f74ce86
+ .quad 0x3e165c002303f790
+ .quad 0x3e682f84095ba7d5
+ .quad 0x3e6d46433541b2c6
+ .quad 0x3e671c3d56e93a89
+ .quad 0x3e598dcef4e40012
+ .quad 0x3e4530ebef17fe03
+ .quad 0x3e4e8b8fa3715066
+ .quad 0x3e6ab26eb3b211dc
+ .quad 0x3e454dd4dc906307
+ .quad 0x3e5c9f962387984e
+ .quad 0x3e6c62a959afec09
+ .quad 0x3e6638d9ac6a866a
+ .quad 0x3e338704eca8a22d
+ .quad 0x3e4e6c9e1db14f8f
+ .quad 0x3e58744b7f9c9eaa
+ .quad 0x3e66c2893486373b
+ .quad 0x3e5b36bce31699b7
+ .quad 0x3e671e3813d200c7
+ .quad 0x3e699755ab40aa88
+ .quad 0x3e6b45ca0e4bcfc0
+ .quad 0x3e32dd090d869c5d
+ .quad 0x3e64fe0516b917da
+ .quad 0x3e694563226317a2
+ .quad 0x3e653d8fafc2c851
+ .quad 0x3e5dcbd41fbd41a3
+ .quad 0x3e5862ff5285f59c
+ .quad 0x3e63072ea97a1e1c
+ .quad 0x3e52839075184805
+ .quad 0x3e64b0323e9eff42
+ .quad 0x3e6b158893c45484
+ .quad 0x3e3149ef0fc35826
+ .quad 0x3e5f2e77ea96acaa
+ .quad 0x3e5200074c471a95
+ .quad 0x3e63f8cc517f6f04
+ .quad 0x3e660ba2e311bb55
+ .quad 0x3e64b788730bbec3
+ .quad 0x3e657090795ee20c
+ .quad 0x3e6d9ffe983670b1
+ .quad 0x3e62a463ff61bfda
+ .quad 0x3e69d1bc6a5e65cf
+ .quad 0x3e68718abaa9e922
+ .quad 0x3e63c2f52ffa342e
+ .quad 0x3e60fae13ff42c80
+ .quad 0x3e65440f0ef00d57
+ .quad 0x3e46fcd22d4e3c1e
+ .quad 0x3e4e0c60b409e863
+ .quad 0x3e6f9cab5a5f0333
+ .quad 0x3e630f24744c333d
+ .quad 0x3e4b50622a76b2fe
+ .quad 0x3e6fdb94ba595375
+ .quad 0x3e3861b9b945a171
+ .quad 0x3e654348015188c4
+ .quad 0x3e6b54d149865523
+ .quad 0x3e6a0bb783d9de33
+ .quad 0x3e6629d12b1a2157
+ .quad 0x3e6467fe35d179df
+ .quad 0x3e69763f3e26c8f7
+ .quad 0x3e53f798bb9f7679
+ .quad 0x3e552e577e855898
+ .quad 0x3e6fde47e5502c3a
+ .quad 0x3e5cbd0b548d96a0
+ .quad 0x3e6a9cd9f7be8de8
+ .quad 0x3e522bbe704886de
+ .quad 0x3e6e3dea8317f020
+ .quad 0x3e6e812085ac8855
+ .quad 0x3e5c87144f24cb07
+ .quad 0x3e61e128ee311fa2
+ .quad 0x3e5b5c163d61a2d3
+ .quad 0x3e47d97e7fb90633
+ .quad 0x3e6efe899d50f6a7
+ .quad 0x3e6d0333eb75de5a
+ .quad 0x3e40e590be73a573
+ .quad 0x3e68ce8dcac3cdd2
+ .quad 0x3e6ee8a48954064b
+ .quad 0x3e6aa62f18461e09
+ .quad 0x3e601e5940986a15
+ .quad 0x3e3b082f4f9b8d4c
+ .quad 0x3e6876e0e5527f5a
+ .quad 0x3e63617080831e6b
+ .quad 0x3e681b26e34aa4a2
+ .quad 0x3e552ee66dfab0c1
+ .quad 0x3e5d85a5329e8819
+ .quad 0x3e5105c1b646b5d1
+ .quad 0x3e6bb6690c1a379c
+ .quad 0x3e586aeba73ce3a9
+ .quad 0x3e6dd16198294dd4
+ .quad 0x3e6454e675775e83
+ .quad 0x3e63842e026197ea
+ .quad 0x3e6f1ce0e70c44d2
+ .quad 0x3e6ad636441a5627
+ .quad 0x3e54c205d7212abb
+ .quad 0x3e6167c86c116419
+ .quad 0x3e638ec3ef16e294
+ .quad 0x3e6473fceace9321
+ .quad 0x3e67af53a836dba7
+ .quad 0x3e1a51f3c383b652
+ .quad 0x3e63696da190822d
+ .quad 0x3e62f9adec77074b
+ .quad 0x3e38190fd5bee55f
+ .quad 0x3e4bfee8fac68e55
+ .quad 0x3e331c9d6bc5f68a
+ .quad 0x3e689d0523737edf
+ .quad 0x3e5a295943bf47bb
+ .quad 0x3e396be32e5b3207
+ .quad 0x3e6e44c7d909fa0e
+ .quad 0x3e2b2505da94d9ea
+ .quad 0x3e60c851f46c9c98
+ .quad 0x3e5da71f7d9aa3b7
+ .quad 0x3e6f1b605d019ef1
+ .quad 0x3e4386e8a2189563
+ .quad 0x3e3b19fa5d306ba7
+ .quad 0x3e6dd749b67aef76
+ .quad 0x3e676ff6f1dc04b0
+ .quad 0x3e635a33d0b232a6
+ .quad 0x3e64bdc80024a4e1
+ .quad 0x3e6ebd61770fd723
+ .quad 0x3e64769fc537264d
+ .quad 0x3e69021f429f3b98
+ .quad 0x3e5ee7083efbd606
+ .quad 0x3e6ad985552a6b1a
+ .quad 0x3e6e3df778772160
+ .quad 0x3e6ca5d76ddc9b34
+ .quad 0x3e691154ffdbaf74
+ .quad 0x3e667bdd57fb306a
+ .quad 0x3e67dc255ac40886
+ .quad 0x3df219f38e8afafe
+ .quad 0x3e62416bf9669a04
+ .quad 0x3e611c96b2b3987f
+ .quad 0x3e6f99ed447e1177
+ .quad 0x3e13245826328a11
+ .quad 0x3e66f56dd1e645f8
+ .quad 0x3e46164946945535
+ .quad 0x3e5e37d59d190028
+ .quad 0x3e668671f12bf828
+ .quad 0x3e6e8ecbca6aabbd
+ .quad 0x3e53f49e109a5912
+ .quad 0x3e6b8a0e11ec3043
+ .quad 0x3e65fae00aed691a
+ .quad 0x3e6c0569bece3e4a
+ .quad 0x3e605e26744efbfe
+ .quad 0x3e65b570a94be5c5
+ .quad 0x3e5d6f156ea0e063
+ .quad 0x3e6e0ca7612fc484
+ .quad 0x3e4963c927b25258
+ .quad 0x3e547930aa725a5c
+ .quad 0x3e58a79fe3af43b3
+ .quad 0x3e5e6dc29c41bdaf
+ .quad 0x3e657a2e76f863a5
+ .quad 0x3e2ae3b61716354d
+ .quad 0x3e665fb5df6906b1
+ .quad 0x3e66177d7f588f7b
+ .quad 0x3e3ad55abd091b67
+ .quad 0x3e155337b2422d76
+ .quad 0x3e6084ebe86972d5
+ .quad 0x3e656395808e1ea3
+ .quad 0x3e61bce21b40fba7
+ .quad 0x3e5006f94605b515
+ .quad 0x3e6aa676aceb1f7d
+ .quad 0x3e58229f76554ce6
+ .quad 0x3e6eabfc6cf57330
+ .quad 0x3e64daed9c0ce8bc
+ .quad 0x3e60ff1768237141
+ .quad 0x3e6575f83051b085
+ .quad 0x3e42667deb523e29
+ .quad 0x3e1816996954f4fd
+ .quad 0x3e587cfccf4d9cd4
+ .quad 0x3e52c5d018198353
+ .quad 0x3e6a7a898dcc34aa
+ .quad 0x3e2cead6dadc36d1
+ .quad 0x3e2a55759c498bdf
+ .quad 0x3e6c414a9ef6de04
+ .quad 0x3e63e2108a6e58fa
+ .quad 0x3e5587fd7643d77c
+ .quad 0x3e3901eb1d3ff3df
+ .quad 0x3e6f2ccd7c812fc6
+ .quad 0x3e21c8ee70a01049
+ .quad 0x3e563e8d02831eec
+ .quad 0x3e6f61a42a92c7ff
+ .quad 0x3dda917399c84d24
+ .quad 0x3e5e9197c8eec2f0
+ .quad 0x3e5e6f842f5a1378
+ .quad 0x3e2fac242a90a0fc
+ .quad 0x3e535ed726610227
+ .quad 0x3e50e0d64804b15b
+ .quad 0x3e0560675daba814
+ .quad 0x3e637388c8768032
+ .quad 0x3e3ee3c89f9e01f5
+ .quad 0x3e639f6f0d09747c
+ .quad 0x3e4322c327abb8f0
+ .quad 0x3e6961b347c8ac80
+ .quad 0x3e63711fbbd0f118
+ .quad 0x3e64fad8d7718ffb
+ .quad 0x3e6fffffffffffff
+ .quad 0x3e667efa79ec35b4
+ .quad 0x3e6a737687a254a8
+ .quad 0x3e5bace0f87d924d
+ .quad 0x3e629e37c237e392
+ .quad 0x3e557ce7ac3f3012
+ .quad 0x3e682829359f8fbd
+ .quad 0x3e6cc9be42d14676
+ .quad 0x3e6a8f001c137d0b
+ .quad 0x3e636127687dda05
+ .quad 0x3e524dba322646f0
+ .quad 0x3e6dc43f1ed210b4
+
+.align 32
+.L__INV_TAB_256: .quad 0x4000000000000000
+ .quad 0x3fffe01fe01fe020
+ .quad 0x3fffc07f01fc07f0
+ .quad 0x3fffa11caa01fa12
+ .quad 0x3fff81f81f81f820
+ .quad 0x3fff6310aca0dbb5
+ .quad 0x3fff44659e4a4271
+ .quad 0x3fff25f644230ab5
+ .quad 0x3fff07c1f07c1f08
+ .quad 0x3ffee9c7f8458e02
+ .quad 0x3ffecc07b301ecc0
+ .quad 0x3ffeae807aba01eb
+ .quad 0x3ffe9131abf0b767
+ .quad 0x3ffe741aa59750e4
+ .quad 0x3ffe573ac901e574
+ .quad 0x3ffe3a9179dc1a73
+ .quad 0x3ffe1e1e1e1e1e1e
+ .quad 0x3ffe01e01e01e01e
+ .quad 0x3ffde5d6e3f8868a
+ .quad 0x3ffdca01dca01dca
+ .quad 0x3ffdae6076b981db
+ .quad 0x3ffd92f2231e7f8a
+ .quad 0x3ffd77b654b82c34
+ .quad 0x3ffd5cac807572b2
+ .quad 0x3ffd41d41d41d41d
+ .quad 0x3ffd272ca3fc5b1a
+ .quad 0x3ffd0cb58f6ec074
+ .quad 0x3ffcf26e5c44bfc6
+ .quad 0x3ffcd85689039b0b
+ .quad 0x3ffcbe6d9601cbe7
+ .quad 0x3ffca4b3055ee191
+ .quad 0x3ffc8b265afb8a42
+ .quad 0x3ffc71c71c71c71c
+ .quad 0x3ffc5894d10d4986
+ .quad 0x3ffc3f8f01c3f8f0
+ .quad 0x3ffc26b5392ea01c
+ .quad 0x3ffc0e070381c0e0
+ .quad 0x3ffbf583ee868d8b
+ .quad 0x3ffbdd2b899406f7
+ .quad 0x3ffbc4fd65883e7b
+ .quad 0x3ffbacf914c1bad0
+ .quad 0x3ffb951e2b18ff23
+ .quad 0x3ffb7d6c3dda338b
+ .quad 0x3ffb65e2e3beee05
+ .quad 0x3ffb4e81b4e81b4f
+ .quad 0x3ffb37484ad806ce
+ .quad 0x3ffb2036406c80d9
+ .quad 0x3ffb094b31d922a4
+ .quad 0x3ffaf286bca1af28
+ .quad 0x3ffadbe87f94905e
+ .quad 0x3ffac5701ac5701b
+ .quad 0x3ffaaf1d2f87ebfd
+ .quad 0x3ffa98ef606a63be
+ .quad 0x3ffa82e65130e159
+ .quad 0x3ffa6d01a6d01a6d
+ .quad 0x3ffa574107688a4a
+ .quad 0x3ffa41a41a41a41a
+ .quad 0x3ffa2c2a87c51ca0
+ .quad 0x3ffa16d3f97a4b02
+ .quad 0x3ffa01a01a01a01a
+ .quad 0x3ff9ec8e951033d9
+ .quad 0x3ff9d79f176b682d
+ .quad 0x3ff9c2d14ee4a102
+ .quad 0x3ff9ae24ea5510da
+ .quad 0x3ff999999999999a
+ .quad 0x3ff9852f0d8ec0ff
+ .quad 0x3ff970e4f80cb872
+ .quad 0x3ff95cbb0be377ae
+ .quad 0x3ff948b0fcd6e9e0
+ .quad 0x3ff934c67f9b2ce6
+ .quad 0x3ff920fb49d0e229
+ .quad 0x3ff90d4f120190d5
+ .quad 0x3ff8f9c18f9c18fa
+ .quad 0x3ff8e6527af1373f
+ .quad 0x3ff8d3018d3018d3
+ .quad 0x3ff8bfce8062ff3a
+ .quad 0x3ff8acb90f6bf3aa
+ .quad 0x3ff899c0f601899c
+ .quad 0x3ff886e5f0abb04a
+ .quad 0x3ff87427bcc092b9
+ .quad 0x3ff8618618618618
+ .quad 0x3ff84f00c2780614
+ .quad 0x3ff83c977ab2bedd
+ .quad 0x3ff82a4a0182a4a0
+ .quad 0x3ff8181818181818
+ .quad 0x3ff8060180601806
+ .quad 0x3ff7f405fd017f40
+ .quad 0x3ff7e225515a4f1d
+ .quad 0x3ff7d05f417d05f4
+ .quad 0x3ff7beb3922e017c
+ .quad 0x3ff7ad2208e0ecc3
+ .quad 0x3ff79baa6bb6398b
+ .quad 0x3ff78a4c8178a4c8
+ .quad 0x3ff77908119ac60d
+ .quad 0x3ff767dce434a9b1
+ .quad 0x3ff756cac201756d
+ .quad 0x3ff745d1745d1746
+ .quad 0x3ff734f0c541fe8d
+ .quad 0x3ff724287f46debc
+ .quad 0x3ff713786d9c7c09
+ .quad 0x3ff702e05c0b8170
+ .quad 0x3ff6f26016f26017
+ .quad 0x3ff6e1f76b4337c7
+ .quad 0x3ff6d1a62681c861
+ .quad 0x3ff6c16c16c16c17
+ .quad 0x3ff6b1490aa31a3d
+ .quad 0x3ff6a13cd1537290
+ .quad 0x3ff691473a88d0c0
+ .quad 0x3ff6816816816817
+ .quad 0x3ff6719f3601671a
+ .quad 0x3ff661ec6a5122f9
+ .quad 0x3ff6524f853b4aa3
+ .quad 0x3ff642c8590b2164
+ .quad 0x3ff63356b88ac0de
+ .quad 0x3ff623fa77016240
+ .quad 0x3ff614b36831ae94
+ .quad 0x3ff6058160581606
+ .quad 0x3ff5f66434292dfc
+ .quad 0x3ff5e75bb8d015e7
+ .quad 0x3ff5d867c3ece2a5
+ .quad 0x3ff5c9882b931057
+ .quad 0x3ff5babcc647fa91
+ .quad 0x3ff5ac056b015ac0
+ .quad 0x3ff59d61f123ccaa
+ .quad 0x3ff58ed2308158ed
+ .quad 0x3ff5805601580560
+ .quad 0x3ff571ed3c506b3a
+ .quad 0x3ff56397ba7c52e2
+ .quad 0x3ff5555555555555
+ .quad 0x3ff54725e6bb82fe
+ .quad 0x3ff5390948f40feb
+ .quad 0x3ff52aff56a8054b
+ .quad 0x3ff51d07eae2f815
+ .quad 0x3ff50f22e111c4c5
+ .quad 0x3ff5015015015015
+ .quad 0x3ff4f38f62dd4c9b
+ .quad 0x3ff4e5e0a72f0539
+ .quad 0x3ff4d843bedc2c4c
+ .quad 0x3ff4cab88725af6e
+ .quad 0x3ff4bd3edda68fe1
+ .quad 0x3ff4afd6a052bf5b
+ .quad 0x3ff4a27fad76014a
+ .quad 0x3ff49539e3b2d067
+ .quad 0x3ff4880522014880
+ .quad 0x3ff47ae147ae147b
+ .quad 0x3ff46dce34596066
+ .quad 0x3ff460cbc7f5cf9a
+ .quad 0x3ff453d9e2c776ca
+ .quad 0x3ff446f86562d9fb
+ .quad 0x3ff43a2730abee4d
+ .quad 0x3ff42d6625d51f87
+ .quad 0x3ff420b5265e5951
+ .quad 0x3ff4141414141414
+ .quad 0x3ff40782d10e6566
+ .quad 0x3ff3fb013fb013fb
+ .quad 0x3ff3ee8f42a5af07
+ .quad 0x3ff3e22cbce4a902
+ .quad 0x3ff3d5d991aa75c6
+ .quad 0x3ff3c995a47babe7
+ .quad 0x3ff3bd60d9232955
+ .quad 0x3ff3b13b13b13b14
+ .quad 0x3ff3a524387ac822
+ .quad 0x3ff3991c2c187f63
+ .quad 0x3ff38d22d366088e
+ .quad 0x3ff3813813813814
+ .quad 0x3ff3755bd1c945ee
+ .quad 0x3ff3698df3de0748
+ .quad 0x3ff35dce5f9f2af8
+ .quad 0x3ff3521cfb2b78c1
+ .quad 0x3ff34679ace01346
+ .quad 0x3ff33ae45b57bcb2
+ .quad 0x3ff32f5ced6a1dfa
+ .quad 0x3ff323e34a2b10bf
+ .quad 0x3ff3187758e9ebb6
+ .quad 0x3ff30d190130d190
+ .quad 0x3ff301c82ac40260
+ .quad 0x3ff2f684bda12f68
+ .quad 0x3ff2eb4ea1fed14b
+ .quad 0x3ff2e025c04b8097
+ .quad 0x3ff2d50a012d50a0
+ .quad 0x3ff2c9fb4d812ca0
+ .quad 0x3ff2bef98e5a3711
+ .quad 0x3ff2b404ad012b40
+ .quad 0x3ff2a91c92f3c105
+ .quad 0x3ff29e4129e4129e
+ .quad 0x3ff293725bb804a5
+ .quad 0x3ff288b01288b013
+ .quad 0x3ff27dfa38a1ce4d
+ .quad 0x3ff27350b8812735
+ .quad 0x3ff268b37cd60127
+ .quad 0x3ff25e22708092f1
+ .quad 0x3ff2539d7e9177b2
+ .quad 0x3ff2492492492492
+ .quad 0x3ff23eb79717605b
+ .quad 0x3ff23456789abcdf
+ .quad 0x3ff22a0122a0122a
+ .quad 0x3ff21fb78121fb78
+ .quad 0x3ff21579804855e6
+ .quad 0x3ff20b470c67c0d9
+ .quad 0x3ff2012012012012
+ .quad 0x3ff1f7047dc11f70
+ .quad 0x3ff1ecf43c7fb84c
+ .quad 0x3ff1e2ef3b3fb874
+ .quad 0x3ff1d8f5672e4abd
+ .quad 0x3ff1cf06ada2811d
+ .quad 0x3ff1c522fc1ce059
+ .quad 0x3ff1bb4a4046ed29
+ .quad 0x3ff1b17c67f2bae3
+ .quad 0x3ff1a7b9611a7b96
+ .quad 0x3ff19e0119e0119e
+ .quad 0x3ff19453808ca29c
+ .quad 0x3ff18ab083902bdb
+ .quad 0x3ff1811811811812
+ .quad 0x3ff1778a191bd684
+ .quad 0x3ff16e0689427379
+ .quad 0x3ff1648d50fc3201
+ .quad 0x3ff15b1e5f75270d
+ .quad 0x3ff151b9a3fdd5c9
+ .quad 0x3ff1485f0e0acd3b
+ .quad 0x3ff13f0e8d344724
+ .quad 0x3ff135c81135c811
+ .quad 0x3ff12c8b89edc0ac
+ .quad 0x3ff12358e75d3033
+ .quad 0x3ff11a3019a74826
+ .quad 0x3ff1111111111111
+ .quad 0x3ff107fbbe011080
+ .quad 0x3ff0fef010fef011
+ .quad 0x3ff0f5edfab325a2
+ .quad 0x3ff0ecf56be69c90
+ .quad 0x3ff0e40655826011
+ .quad 0x3ff0db20a88f4696
+ .quad 0x3ff0d24456359e3a
+ .quad 0x3ff0c9714fbcda3b
+ .quad 0x3ff0c0a7868b4171
+ .quad 0x3ff0b7e6ec259dc8
+ .quad 0x3ff0af2f722eecb5
+ .quad 0x3ff0a6810a6810a7
+ .quad 0x3ff09ddba6af8360
+ .quad 0x3ff0953f39010954
+ .quad 0x3ff08cabb37565e2
+ .quad 0x3ff0842108421084
+ .quad 0x3ff07b9f29b8eae2
+ .quad 0x3ff073260a47f7c6
+ .quad 0x3ff06ab59c7912fb
+ .quad 0x3ff0624dd2f1a9fc
+ .quad 0x3ff059eea0727586
+ .quad 0x3ff05197f7d73404
+ .quad 0x3ff04949cc1664c5
+ .quad 0x3ff0410410410410
+ .quad 0x3ff038c6b78247fc
+ .quad 0x3ff03091b51f5e1a
+ .quad 0x3ff02864fc7729e9
+ .quad 0x3ff0204081020408
+ .quad 0x3ff0182436517a37
+ .quad 0x3ff0101010101010
+ .quad 0x3ff0080402010080
+ .quad 0x3ff0000000000000
+
diff --git a/src/gas/cbrtf.S b/src/gas/cbrtf.S
new file mode 100644
index 0000000..21bdd0b
--- /dev/null
+++ b/src/gas/cbrtf.S
@@ -0,0 +1,717 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# cbrtf.S
+#
+# An implementation of the cbrtf libm function.
+#
+# Prototype:
+#
+# float cbrtf(float x);
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(cbrtf)
+#define fname_special _cbrtf_special
+
+
+# local variable storage offsets
+
+.equ store_input, 0x0
+.equ stack_size, 0x20
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 32
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+ xor %rcx,%rcx
+ sub $stack_size, %rsp
+ movss %xmm0, store_input(%rsp)
+ movss %xmm0,%xmm1
+ mov store_input(%rsp),%r8
+ mov $0x7F800000,%r10
+ mov $0x007FFFFF,%r11
+ mov %r8,%r9
+ and %r10,%r8 # r8 = stores the exponent
+ and %r11,%r9 # r9 = stores the mantissa
+ cmp $0X7F800000,%r8
+ jz .L__cbrtf_is_nan_infinite
+ cmp $0X0,%r8
+ jz .L__cbrtf_is_denormal
+.align 32
+.L__cbrtf_is_normal:
+ cvtps2pd %xmm1,%xmm1
+ shr $23,%r8 # exp value
+ mov $3,%rdx # check whether always dx is set to 3
+ mov %r8,%rax
+ movsd %xmm1,%xmm6
+ shr $15,%r9 # index for the reciprocal
+ sub $0x7F,%ax
+ idiv %dl # Accumulator is divided by dl=3
+ mov %ax,%dx
+ shr $8,%dx #dx = Contains the remainder
+ add $2,%dl
+ #ax = Contains the quotient, Scale factor
+ cbw # sign extend al to ax
+ add $0x3FF,%ax
+ shl $52,%rax
+ pand .L__mantissa_mask_64(%rip),%xmm1
+ mov %rax,store_input(%rsp)
+ movsd store_input(%rsp),%xmm7
+ movsd .L__sign_mask_64(%rip),%xmm2
+ por .L__one_mask_64(%rip),%xmm1
+ movapd .L__coefficients(%rip),%xmm0
+ pandn %xmm1,%xmm2
+ pand .L__sign_mask_64(%rip),%xmm6 # has the sign
+ lea .L__DoubleReciprocalTable_256(%rip),%r8
+ lea .L__CubeRootTable_256(%rip),%rax
+ movsd (%r8,%r9,8),%xmm3#reciprocal, Size of double is 8
+ movsd (%rax,%r9,8),%xmm4#cuberoot
+ mulsd %xmm2,%xmm3
+ subsd .L__one_mask_64(%rip),%xmm3
+
+ # movddup %xmm3,%xmm3
+ shufpd $0,%xmm3,%xmm3 # replacing movddup
+
+ mulsd %xmm3,%xmm3
+ mulpd %xmm3,%xmm0
+#######################################################################
+#haddpd is an SSE3 instruction On using this instruction it gives a better performance
+ #haddpd %xmm0,%xmm0
+#Following has to be commented and the above haddpd has to be uncommented if we can
+#use the SSE3 instructions
+ movapd %xmm0,%xmm3
+ unpckhpd %xmm3,%xmm3
+ addsd %xmm3,%xmm0
+#######################################################################
+ addsd .L__one_mask_64(%rip),%xmm0
+ mulsd %xmm7,%xmm0
+ lea .L__defined_cuberoot(%rip),%rax
+ mulsd (%rax,%rdx,8),%xmm0
+
+ mulsd %xmm4,%xmm0
+ cmp $1,%cx
+ jnz .L__final_result
+ mulsd .L__denormal_factor(%rip),%xmm0
+
+.align 32
+.L__final_result:
+ por %xmm6, %xmm0
+ cvtsd2ss %xmm0,%xmm0
+ add $stack_size, %rsp
+ ret
+
+
+.align 32
+.L__cbrtf_is_denormal:
+ cmp $0,%r9
+ jz .L__cbrtf_is_zero
+ mulss .L__2_pow_23(%rip),%xmm1
+ movss %xmm1, store_input(%rsp)
+ mov $1,%cx
+ mov store_input(%rsp),%r8
+ mov %r8,%r9
+ and %r10,%r8 # r8 = stores the exponent
+ and %r11,%r9 # r9 = stores the mantissa
+ jmp .L__cbrtf_is_normal
+
+.align 32
+.L__cbrtf_is_nan_infinite:
+ cmp $0,%r9
+ jz .L__cbrtf_is_infinite
+ mulss %xmm0,%xmm0 #this multiplication will raise an invalid exception
+ por .L__qnan_mask_32(%rip),%xmm0
+
+.L__cbrtf_is_infinite:
+.L__cbrtf_is_one:
+.L__cbrtf_is_zero:
+ add $stack_size, %rsp
+ ret
+
+.align 32
+.L__mantissa_mask_32: .long 0x007FFFFF
+ .long 0 #this zero is necessary
+.align 16
+.L__qnan_mask_32: .long 0x00400000
+ .long 0
+.L__exp_mask_32: .long 0x7F800000
+ .long 0
+.L__zero: .long 0x00000000
+ .long 0
+.align 16
+.L__mantissa_mask_64: .quad 0x000FFFFFFFFFFFFF
+.L__2_pow_23: .long 0x4B000000
+
+
+.align 16
+.L__sign_mask_64: .quad 0x8000000000000000
+ .quad 0
+.L__one_mask_64: .quad 0x3FF0000000000000
+ .quad 0
+
+.align 16
+.L__denormal_factor: .quad 0x3F7428A2F98D728B
+ .quad 0
+.align 16
+.L__coefficients:
+ .quad 0xbFBC71C71C71C71C
+ .quad 0x3fd5555555555555
+.align 16
+.L__defined_cuberoot: .quad 0x3FE428A2F98D728B
+ .quad 0x3FE965FEA53D6E3D
+ .quad 0x3FF0000000000000
+ .quad 0x3FF428A2F98D728B
+ .quad 0x3FF965FEA53D6E3D
+
+.align 32
+.L__DoubleReciprocalTable_256: .quad 0X3ff0000000000000
+ .quad 0X3fefe00000000000
+ .quad 0X3fefc00000000000
+ .quad 0X3fefa00000000000
+ .quad 0X3fef800000000000
+ .quad 0X3fef600000000000
+ .quad 0X3fef400000000000
+ .quad 0X3fef200000000000
+ .quad 0X3fef000000000000
+ .quad 0X3feee00000000000
+ .quad 0X3feec00000000000
+ .quad 0X3feea00000000000
+ .quad 0X3fee900000000000
+ .quad 0X3fee700000000000
+ .quad 0X3fee500000000000
+ .quad 0X3fee300000000000
+ .quad 0X3fee100000000000
+ .quad 0X3fee000000000000
+ .quad 0X3fede00000000000
+ .quad 0X3fedc00000000000
+ .quad 0X3feda00000000000
+ .quad 0X3fed900000000000
+ .quad 0X3fed700000000000
+ .quad 0X3fed500000000000
+ .quad 0X3fed400000000000
+ .quad 0X3fed200000000000
+ .quad 0X3fed000000000000
+ .quad 0X3fecf00000000000
+ .quad 0X3fecd00000000000
+ .quad 0X3fecb00000000000
+ .quad 0X3feca00000000000
+ .quad 0X3fec800000000000
+ .quad 0X3fec700000000000
+ .quad 0X3fec500000000000
+ .quad 0X3fec300000000000
+ .quad 0X3fec200000000000
+ .quad 0X3fec000000000000
+ .quad 0X3febf00000000000
+ .quad 0X3febd00000000000
+ .quad 0X3febc00000000000
+ .quad 0X3feba00000000000
+ .quad 0X3feb900000000000
+ .quad 0X3feb700000000000
+ .quad 0X3feb600000000000
+ .quad 0X3feb400000000000
+ .quad 0X3feb300000000000
+ .quad 0X3feb200000000000
+ .quad 0X3feb000000000000
+ .quad 0X3feaf00000000000
+ .quad 0X3fead00000000000
+ .quad 0X3feac00000000000
+ .quad 0X3feaa00000000000
+ .quad 0X3fea900000000000
+ .quad 0X3fea800000000000
+ .quad 0X3fea600000000000
+ .quad 0X3fea500000000000
+ .quad 0X3fea400000000000
+ .quad 0X3fea200000000000
+ .quad 0X3fea100000000000
+ .quad 0X3fea000000000000
+ .quad 0X3fe9e00000000000
+ .quad 0X3fe9d00000000000
+ .quad 0X3fe9c00000000000
+ .quad 0X3fe9a00000000000
+ .quad 0X3fe9900000000000
+ .quad 0X3fe9800000000000
+ .quad 0X3fe9700000000000
+ .quad 0X3fe9500000000000
+ .quad 0X3fe9400000000000
+ .quad 0X3fe9300000000000
+ .quad 0X3fe9200000000000
+ .quad 0X3fe9000000000000
+ .quad 0X3fe8f00000000000
+ .quad 0X3fe8e00000000000
+ .quad 0X3fe8d00000000000
+ .quad 0X3fe8b00000000000
+ .quad 0X3fe8a00000000000
+ .quad 0X3fe8900000000000
+ .quad 0X3fe8800000000000
+ .quad 0X3fe8700000000000
+ .quad 0X3fe8600000000000
+ .quad 0X3fe8400000000000
+ .quad 0X3fe8300000000000
+ .quad 0X3fe8200000000000
+ .quad 0X3fe8100000000000
+ .quad 0X3fe8000000000000
+ .quad 0X3fe7f00000000000
+ .quad 0X3fe7e00000000000
+ .quad 0X3fe7d00000000000
+ .quad 0X3fe7b00000000000
+ .quad 0X3fe7a00000000000
+ .quad 0X3fe7900000000000
+ .quad 0X3fe7800000000000
+ .quad 0X3fe7700000000000
+ .quad 0X3fe7600000000000
+ .quad 0X3fe7500000000000
+ .quad 0X3fe7400000000000
+ .quad 0X3fe7300000000000
+ .quad 0X3fe7200000000000
+ .quad 0X3fe7100000000000
+ .quad 0X3fe7000000000000
+ .quad 0X3fe6f00000000000
+ .quad 0X3fe6e00000000000
+ .quad 0X3fe6d00000000000
+ .quad 0X3fe6c00000000000
+ .quad 0X3fe6b00000000000
+ .quad 0X3fe6a00000000000
+ .quad 0X3fe6900000000000
+ .quad 0X3fe6800000000000
+ .quad 0X3fe6700000000000
+ .quad 0X3fe6600000000000
+ .quad 0X3fe6500000000000
+ .quad 0X3fe6400000000000
+ .quad 0X3fe6300000000000
+ .quad 0X3fe6200000000000
+ .quad 0X3fe6100000000000
+ .quad 0X3fe6000000000000
+ .quad 0X3fe5f00000000000
+ .quad 0X3fe5e00000000000
+ .quad 0X3fe5d00000000000
+ .quad 0X3fe5c00000000000
+ .quad 0X3fe5b00000000000
+ .quad 0X3fe5a00000000000
+ .quad 0X3fe5900000000000
+ .quad 0X3fe5800000000000
+ .quad 0X3fe5800000000000
+ .quad 0X3fe5700000000000
+ .quad 0X3fe5600000000000
+ .quad 0X3fe5500000000000
+ .quad 0X3fe5400000000000
+ .quad 0X3fe5300000000000
+ .quad 0X3fe5200000000000
+ .quad 0X3fe5100000000000
+ .quad 0X3fe5000000000000
+ .quad 0X3fe5000000000000
+ .quad 0X3fe4f00000000000
+ .quad 0X3fe4e00000000000
+ .quad 0X3fe4d00000000000
+ .quad 0X3fe4c00000000000
+ .quad 0X3fe4b00000000000
+ .quad 0X3fe4a00000000000
+ .quad 0X3fe4a00000000000
+ .quad 0X3fe4900000000000
+ .quad 0X3fe4800000000000
+ .quad 0X3fe4700000000000
+ .quad 0X3fe4600000000000
+ .quad 0X3fe4600000000000
+ .quad 0X3fe4500000000000
+ .quad 0X3fe4400000000000
+ .quad 0X3fe4300000000000
+ .quad 0X3fe4200000000000
+ .quad 0X3fe4200000000000
+ .quad 0X3fe4100000000000
+ .quad 0X3fe4000000000000
+ .quad 0X3fe3f00000000000
+ .quad 0X3fe3e00000000000
+ .quad 0X3fe3e00000000000
+ .quad 0X3fe3d00000000000
+ .quad 0X3fe3c00000000000
+ .quad 0X3fe3b00000000000
+ .quad 0X3fe3b00000000000
+ .quad 0X3fe3a00000000000
+ .quad 0X3fe3900000000000
+ .quad 0X3fe3800000000000
+ .quad 0X3fe3800000000000
+ .quad 0X3fe3700000000000
+ .quad 0X3fe3600000000000
+ .quad 0X3fe3500000000000
+ .quad 0X3fe3500000000000
+ .quad 0X3fe3400000000000
+ .quad 0X3fe3300000000000
+ .quad 0X3fe3200000000000
+ .quad 0X3fe3200000000000
+ .quad 0X3fe3100000000000
+ .quad 0X3fe3000000000000
+ .quad 0X3fe3000000000000
+ .quad 0X3fe2f00000000000
+ .quad 0X3fe2e00000000000
+ .quad 0X3fe2e00000000000
+ .quad 0X3fe2d00000000000
+ .quad 0X3fe2c00000000000
+ .quad 0X3fe2b00000000000
+ .quad 0X3fe2b00000000000
+ .quad 0X3fe2a00000000000
+ .quad 0X3fe2900000000000
+ .quad 0X3fe2900000000000
+ .quad 0X3fe2800000000000
+ .quad 0X3fe2700000000000
+ .quad 0X3fe2700000000000
+ .quad 0X3fe2600000000000
+ .quad 0X3fe2500000000000
+ .quad 0X3fe2500000000000
+ .quad 0X3fe2400000000000
+ .quad 0X3fe2300000000000
+ .quad 0X3fe2300000000000
+ .quad 0X3fe2200000000000
+ .quad 0X3fe2100000000000
+ .quad 0X3fe2100000000000
+ .quad 0X3fe2000000000000
+ .quad 0X3fe2000000000000
+ .quad 0X3fe1f00000000000
+ .quad 0X3fe1e00000000000
+ .quad 0X3fe1e00000000000
+ .quad 0X3fe1d00000000000
+ .quad 0X3fe1c00000000000
+ .quad 0X3fe1c00000000000
+ .quad 0X3fe1b00000000000
+ .quad 0X3fe1b00000000000
+ .quad 0X3fe1a00000000000
+ .quad 0X3fe1900000000000
+ .quad 0X3fe1900000000000
+ .quad 0X3fe1800000000000
+ .quad 0X3fe1800000000000
+ .quad 0X3fe1700000000000
+ .quad 0X3fe1600000000000
+ .quad 0X3fe1600000000000
+ .quad 0X3fe1500000000000
+ .quad 0X3fe1500000000000
+ .quad 0X3fe1400000000000
+ .quad 0X3fe1300000000000
+ .quad 0X3fe1300000000000
+ .quad 0X3fe1200000000000
+ .quad 0X3fe1200000000000
+ .quad 0X3fe1100000000000
+ .quad 0X3fe1100000000000
+ .quad 0X3fe1000000000000
+ .quad 0X3fe0f00000000000
+ .quad 0X3fe0f00000000000
+ .quad 0X3fe0e00000000000
+ .quad 0X3fe0e00000000000
+ .quad 0X3fe0d00000000000
+ .quad 0X3fe0d00000000000
+ .quad 0X3fe0c00000000000
+ .quad 0X3fe0c00000000000
+ .quad 0X3fe0b00000000000
+ .quad 0X3fe0a00000000000
+ .quad 0X3fe0a00000000000
+ .quad 0X3fe0900000000000
+ .quad 0X3fe0900000000000
+ .quad 0X3fe0800000000000
+ .quad 0X3fe0800000000000
+ .quad 0X3fe0700000000000
+ .quad 0X3fe0700000000000
+ .quad 0X3fe0600000000000
+ .quad 0X3fe0600000000000
+ .quad 0X3fe0500000000000
+ .quad 0X3fe0500000000000
+ .quad 0X3fe0400000000000
+ .quad 0X3fe0400000000000
+ .quad 0X3fe0300000000000
+ .quad 0X3fe0300000000000
+ .quad 0X3fe0200000000000
+ .quad 0X3fe0200000000000
+ .quad 0X3fe0100000000000
+ .quad 0X3fe0100000000000
+ .quad 0X3fe0000000000000
+
+.align 32
+.L__CubeRootTable_256: .quad 0X3ff0000000000000
+ .quad 0X3ff00558e6547c36
+ .quad 0X3ff00ab8f9d2f374
+ .quad 0X3ff010204b673fc7
+ .quad 0X3ff0158eec36749b
+ .quad 0X3ff01b04ed9fdb53
+ .quad 0X3ff02082613df53c
+ .quad 0X3ff0260758e78308
+ .quad 0X3ff02b93e6b091f0
+ .quad 0X3ff031281ceb8ea2
+ .quad 0X3ff036c40e2a5e2a
+ .quad 0X3ff03c67cd3f7cea
+ .quad 0X3ff03f3c9fee224c
+ .quad 0X3ff044ec379f7f79
+ .quad 0X3ff04aa3cd578d67
+ .quad 0X3ff0506374d40a3d
+ .quad 0X3ff0562b4218a6e3
+ .quad 0X3ff059123d3a9848
+ .quad 0X3ff05ee6694e7166
+ .quad 0X3ff064c2ee6e07c6
+ .quad 0X3ff06aa7e19c01c5
+ .quad 0X3ff06d9d8b1decca
+ .quad 0X3ff0738f4b6cc8e2
+ .quad 0X3ff07989af9f9f59
+ .quad 0X3ff07c8a2611201c
+ .quad 0X3ff08291a9958f03
+ .quad 0X3ff088a208c3fe28
+ .quad 0X3ff08bad91dd7d8b
+ .quad 0X3ff091cb6588465e
+ .quad 0X3ff097f24eab04a1
+ .quad 0X3ff09b0932aee3f2
+ .quad 0X3ff0a13de8970de4
+ .quad 0X3ff0a45bc08a5ac7
+ .quad 0X3ff0aa9e79bfa986
+ .quad 0X3ff0b0eaa961ca5b
+ .quad 0X3ff0b4145573271c
+ .quad 0X3ff0ba6ee5f9aad4
+ .quad 0X3ff0bd9fd0dbe02d
+ .quad 0X3ff0c408fc1cfd4b
+ .quad 0X3ff0c741430e2059
+ .quad 0X3ff0cdb9442ea813
+ .quad 0X3ff0d0f905168e6c
+ .quad 0X3ff0d7801893d261
+ .quad 0X3ff0dac772091bde
+ .quad 0X3ff0e15dd5c330ab
+ .quad 0X3ff0e4ace71080a4
+ .quad 0X3ff0e7fe920f3037
+ .quad 0X3ff0eea9c37e497e
+ .quad 0X3ff0f203512f4314
+ .quad 0X3ff0f8be68db7f32
+ .quad 0X3ff0fc1ffa42d902
+ .quad 0X3ff102eb3af9ed89
+ .quad 0X3ff10654f1e29cfb
+ .quad 0X3ff109c1679c189f
+ .quad 0X3ff110a29f080b3d
+ .quad 0X3ff114176891738a
+ .quad 0X3ff1178f0099b429
+ .quad 0X3ff11e86ac2cd7ab
+ .quad 0X3ff12206c7cf4046
+ .quad 0X3ff12589c21fb842
+ .quad 0X3ff12c986355d0d2
+ .quad 0X3ff13024129645cf
+ .quad 0X3ff133b2b13aa0eb
+ .quad 0X3ff13ad8cdc48ba3
+ .quad 0X3ff13e70544b1d4f
+ .quad 0X3ff1420adb77c99a
+ .quad 0X3ff145a867b1bfea
+ .quad 0X3ff14ceca1189d6d
+ .quad 0X3ff15093574284e9
+ .quad 0X3ff1543d2473ea9b
+ .quad 0X3ff157ea0d433a46
+ .quad 0X3ff15f4d44462724
+ .quad 0X3ff163039bd7cde6
+ .quad 0X3ff166bd21c3a8e2
+ .quad 0X3ff16a79dad1fb59
+ .quad 0X3ff171fcf9aaac3d
+ .quad 0X3ff175c3693980c3
+ .quad 0X3ff1798d1f73f3ef
+ .quad 0X3ff17d5a2156e97f
+ .quad 0X3ff1812a73ea2593
+ .quad 0X3ff184fe1c406b8f
+ .quad 0X3ff18caf82b8dba4
+ .quad 0X3ff1908d4b38a510
+ .quad 0X3ff1946e7e36f7e5
+ .quad 0X3ff1985320ff72a2
+ .quad 0X3ff19c3b38e975a8
+ .quad 0X3ff1a026cb58453d
+ .quad 0X3ff1a415ddbb2c10
+ .quad 0X3ff1a808758d9e32
+ .quad 0X3ff1aff84bac98ea
+ .quad 0X3ff1b3f5952e1a50
+ .quad 0X3ff1b7f67a896220
+ .quad 0X3ff1bbfb0178d186
+ .quad 0X3ff1c0032fc3cf91
+ .quad 0X3ff1c40f0b3eefc4
+ .quad 0X3ff1c81e99cc193f
+ .quad 0X3ff1cc31e15aae72
+ .quad 0X3ff1d048e7e7b565
+ .quad 0X3ff1d463b37e0090
+ .quad 0X3ff1d8824a365852
+ .quad 0X3ff1dca4b237a4f7
+ .quad 0X3ff1e0caf1b71965
+ .quad 0X3ff1e4f50ef85e61
+ .quad 0X3ff1e923104dbe76
+ .quad 0X3ff1ed54fc185286
+ .quad 0X3ff1f18ad8c82efc
+ .quad 0X3ff1f5c4acdc91aa
+ .quad 0X3ff1fa027ee4105b
+ .quad 0X3ff1fe44557cc808
+ .quad 0X3ff2028a37548ccf
+ .quad 0X3ff206d42b291a95
+ .quad 0X3ff20b2237c8466a
+ .quad 0X3ff20f74641030a6
+ .quad 0X3ff213cab6ef77c7
+ .quad 0X3ff2182537656c13
+ .quad 0X3ff21c83ec824406
+ .quad 0X3ff220e6dd675180
+ .quad 0X3ff2254e114737d2
+ .quad 0X3ff229b98f66228c
+ .quad 0X3ff22e295f19fd31
+ .quad 0X3ff2329d87caabb6
+ .quad 0X3ff2371610f243f2
+ .quad 0X3ff23b93021d47da
+ .quad 0X3ff2401462eae0b8
+ .quad 0X3ff2449a3b0d1b3f
+ .quad 0X3ff2449a3b0d1b3f
+ .quad 0X3ff2492492492492
+ .quad 0X3ff24db370778844
+ .quad 0X3ff25246dd846f45
+ .quad 0X3ff256dee16fdfd4
+ .quad 0X3ff25b7b844dfe71
+ .quad 0X3ff2601cce474fd2
+ .quad 0X3ff264c2c798fbe5
+ .quad 0X3ff2696d789511e2
+ .quad 0X3ff2696d789511e2
+ .quad 0X3ff26e1ce9a2cd73
+ .quad 0X3ff272d1233edcf3
+ .quad 0X3ff2778a2dfba8d0
+ .quad 0X3ff27c4812819c13
+ .quad 0X3ff2810ad98f6e10
+ .quad 0X3ff285d28bfa6d45
+ .quad 0X3ff285d28bfa6d45
+ .quad 0X3ff28a9f32aecb79
+ .quad 0X3ff28f70d6afeb08
+ .quad 0X3ff294478118ad83
+ .quad 0X3ff299233b1bc38a
+ .quad 0X3ff299233b1bc38a
+ .quad 0X3ff29e040e03fdfb
+ .quad 0X3ff2a2ea0334a07b
+ .quad 0X3ff2a7d52429b556
+ .quad 0X3ff2acc57a7862c2
+ .quad 0X3ff2acc57a7862c2
+ .quad 0X3ff2b1bb0fcf4190
+ .quad 0X3ff2b6b5edf6b54a
+ .quad 0X3ff2bbb61ed145cf
+ .quad 0X3ff2c0bbac5bfa6e
+ .quad 0X3ff2c0bbac5bfa6e
+ .quad 0X3ff2c5c6a0aeb681
+ .quad 0X3ff2cad705fc97a6
+ .quad 0X3ff2cfece6945583
+ .quad 0X3ff2cfece6945583
+ .quad 0X3ff2d5084ce0a331
+ .quad 0X3ff2da294368924f
+ .quad 0X3ff2df4fd4cff7c3
+ .quad 0X3ff2df4fd4cff7c3
+ .quad 0X3ff2e47c0bd7d237
+ .quad 0X3ff2e9adf35eb25a
+ .quad 0X3ff2eee5966124e8
+ .quad 0X3ff2eee5966124e8
+ .quad 0X3ff2f422fffa1e92
+ .quad 0X3ff2f9663b6369b6
+ .quad 0X3ff2feaf53f61612
+ .quad 0X3ff2feaf53f61612
+ .quad 0X3ff303fe552aea57
+ .quad 0X3ff309534a9ad7ce
+ .quad 0X3ff309534a9ad7ce
+ .quad 0X3ff30eae3fff6ff3
+ .quad 0X3ff3140f41335c2f
+ .quad 0X3ff3140f41335c2f
+ .quad 0X3ff319765a32d7ae
+ .quad 0X3ff31ee3971c2b5b
+ .quad 0X3ff3245704302c13
+ .quad 0X3ff3245704302c13
+ .quad 0X3ff329d0add2bb20
+ .quad 0X3ff32f50a08b48f9
+ .quad 0X3ff32f50a08b48f9
+ .quad 0X3ff334d6e9055a5f
+ .quad 0X3ff33a6394110fe6
+ .quad 0X3ff33a6394110fe6
+ .quad 0X3ff33ff6aea3afed
+ .quad 0X3ff3459045d8331b
+ .quad 0X3ff3459045d8331b
+ .quad 0X3ff34b3066efd36b
+ .quad 0X3ff350d71f529dd8
+ .quad 0X3ff350d71f529dd8
+ .quad 0X3ff356847c9006b4
+ .quad 0X3ff35c388c5f80bf
+ .quad 0X3ff35c388c5f80bf
+ .quad 0X3ff361f35ca116ff
+ .quad 0X3ff361f35ca116ff
+ .quad 0X3ff367b4fb5e0985
+ .quad 0X3ff36d7d76c96d0a
+ .quad 0X3ff36d7d76c96d0a
+ .quad 0X3ff3734cdd40cd95
+ .quad 0X3ff379233d4cd42a
+ .quad 0X3ff379233d4cd42a
+ .quad 0X3ff37f00a5a1ef96
+ .quad 0X3ff37f00a5a1ef96
+ .quad 0X3ff384e52521006c
+ .quad 0X3ff38ad0cad80848
+ .quad 0X3ff38ad0cad80848
+ .quad 0X3ff390c3a602dc60
+ .quad 0X3ff390c3a602dc60
+ .quad 0X3ff396bdc60bdb88
+ .quad 0X3ff39cbf3a8ca7a9
+ .quad 0X3ff39cbf3a8ca7a9
+ .quad 0X3ff3a2c8134ee2d1
+ .quad 0X3ff3a2c8134ee2d1
+ .quad 0X3ff3a8d8604cefe3
+ .quad 0X3ff3aef031b2b706
+ .quad 0X3ff3aef031b2b706
+ .quad 0X3ff3b50f97de6de5
+ .quad 0X3ff3b50f97de6de5
+ .quad 0X3ff3bb36a36163d8
+ .quad 0X3ff3bb36a36163d8
+ .quad 0X3ff3c1656500d20a
+ .quad 0X3ff3c79bedb6afb8
+ .quad 0X3ff3c79bedb6afb8
+ .quad 0X3ff3cdda4eb28aa2
+ .quad 0X3ff3cdda4eb28aa2
+ .quad 0X3ff3d420995a63c0
+ .quad 0X3ff3d420995a63c0
+ .quad 0X3ff3da6edf4b9061
+ .quad 0X3ff3da6edf4b9061
+ .quad 0X3ff3e0c5325b9fc2
+ .quad 0X3ff3e723a499453f
+ .quad 0X3ff3e723a499453f
+ .quad 0X3ff3ed8a484d473a
+ .quad 0X3ff3ed8a484d473a
+ .quad 0X3ff3f3f92ffb72d8
+ .quad 0X3ff3f3f92ffb72d8
+ .quad 0X3ff3fa706e6394a4
+ .quad 0X3ff3fa706e6394a4
+ .quad 0X3ff400f01682764a
+ .quad 0X3ff400f01682764a
+ .quad 0X3ff407783b92e17a
+ .quad 0X3ff407783b92e17a
+ .quad 0X3ff40e08f10ea81a
+ .quad 0X3ff40e08f10ea81a
+ .quad 0X3ff414a24aafb1e6
+ .quad 0X3ff414a24aafb1e6
+ .quad 0X3ff41b445c710fa7
+ .quad 0X3ff41b445c710fa7
+ .quad 0X3ff421ef3a901411
+ .quad 0X3ff421ef3a901411
+ .quad 0X3ff428a2f98d728b
+
+
+
+
+
+
diff --git a/src/gas/copysign.S b/src/gas/copysign.S
new file mode 100644
index 0000000..d5b96cf
--- /dev/null
+++ b/src/gas/copysign.S
@@ -0,0 +1,63 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#copysign.S
+#
+# An implementation of the copysign libm function.
+#
+# The copysign functions produce a value with the magnitude of x and the sign of y.
+# They produce a NaN (with the sign of y) if x is a NaN. On implementations that
+# represent a signed zero but do not treat negative zero consistently in arithmetic
+# operations, the copysign functions regard the sign of zero as positive.
+#
+#
+# Prototype:
+#
+# double copysign(float x, float y)
+#
+#
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(copysign)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ PSLLQ $1,%xmm0
+ PSRLQ $1,%xmm0
+ PSRLQ $63,%xmm1
+ PSLLQ $63,%xmm1
+ POR %xmm1,%xmm0
+
+ ret
diff --git a/src/gas/copysignf.S b/src/gas/copysignf.S
new file mode 100644
index 0000000..90e63d6
--- /dev/null
+++ b/src/gas/copysignf.S
@@ -0,0 +1,70 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#copysignf.S
+#
+# An implementation of the copysignf libm function.
+#
+# The copysign functions produce a value with the magnitude of x and the sign of y.
+# They produce a NaN (with the sign of y) if x is a NaN. On implementations that
+# represent a signed zero but do not treat negative zero consistently in arithmetic
+# operations, the copysign functions regard the sign of zero as positive.
+#
+# Prototype:
+#
+# float copysignf(float x, float y)#
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(copysignf)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+ #PANDN .L__fabsf_and_mask, %xmm1
+ #POR %xmm1,%xmm0
+
+ PSLLD $1,%xmm0
+ PSRLD $1,%xmm0
+ PSRLD $31,%xmm1
+ PSLLD $31,%xmm1
+ POR %xmm1,%xmm0
+
+ ret
+
+#.align 16
+#.L__sign_mask: .long 0x7FFFFFFF
+ .long 0x0
+ .quad 0x0
+
diff --git a/src/gas/cos.S b/src/gas/cos.S
new file mode 100644
index 0000000..dc227e0
--- /dev/null
+++ b/src/gas/cos.S
@@ -0,0 +1,485 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# An implementation of the cos function.
+#
+# Prototype:
+#
+# double cos(double x);
+#
+# Computes cos(x).
+# It will provide proper C99 return values,
+# but may not raise floating point status bits properly.
+# Based on the NAG C implementation.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 32
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0 # for alignment
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0
+.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5
+ .quad 0
+.L__real_bfe0000000000000: .quad 0x0bfe0000000000000 # - 0.5
+ .quad 0
+
+.align 32
+.Lcosarray:
+ .quad 0x3fa5555555555555 # 0.0416667 c1
+ .quad 0
+ .quad 0xbf56c16c16c16967 # -0.00138889 c2
+ .quad 0
+ .quad 0x3EFA01A019F4EC91 # 2.48016e-005 c3
+ .quad 0
+ .quad 0xbE927E4FA17F667B # -2.75573e-007 c4
+ .quad 0
+ .quad 0x3E21EEB690382EEC # 2.08761e-009 c5
+ .quad 0
+ .quad 0xbDA907DB47258AA7 # -1.13826e-011 c6
+ .quad 0
+
+.align 32
+.Lsinarray:
+ .quad 0xbfc5555555555555 # -0.166667 s1
+ .quad 0
+ .quad 0x3f81111111110bb3 # 0.00833333 s2
+ .quad 0
+ .quad 0xbf2a01a019e83e5c # -0.000198413 s3
+ .quad 0
+ .quad 0x3ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0
+ .quad 0xbe5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0
+ .quad 0x3de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0
+
+.text
+.align 32
+.p2align 5,,31
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(cos)
+#define fname_special _cos_special@PLT
+
+# define local variable storage offsets
+.equ p_temp, 0x30 # temporary for get/put bits operation
+.equ p_temp1, 0x40 # temporary for get/put bits operation
+.equ r, 0x50 # pointer to r for amd_remainder_piby2
+.equ rr, 0x60 # pointer to rr for amd_remainder_piby2
+.equ region, 0x70 # pointer to region for amd_remainder_piby2
+.equ stack_size, 0x98
+
+.globl fname
+.type fname,@function
+
+fname:
+ sub $stack_size, %rsp
+ xorpd %xmm2, %xmm2 # zeroed out for later use
+
+# GET_BITS_DP64(x, ux);
+# get the input value to an integer register.
+ movsd %xmm0,p_temp(%rsp)
+ mov p_temp(%rsp), %rdx # rdx is ux
+
+## if NaN or inf
+ mov $0x07ff0000000000000, %rax
+ mov %rax, %r10
+ and %rdx, %r10
+ cmp %rax, %r10
+ jz .Lcos_naninf
+
+# ax = (ux & ~SIGNBIT_DP64);
+ mov $0x07fffffffffffffff, %r10
+ and %rdx, %r10 # r10 is ax
+ mov $1, %r8d # for determining region later on
+
+
+## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+ mov $0x03fe921fb54442d18, %rax
+ cmp %rax, %r10
+ jg .Lcos_reduce
+
+## if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+ mov $0x03f20000000000000, %rax
+ cmp %rax, %r10
+ jge .Lcos_small
+
+## if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */
+ mov $0x03e40000000000000, %rax
+ cmp %rax, %r10
+ jge .Lcos_smaller
+
+# cos = 1.0;
+ movsd .L__real_3ff0000000000000(%rip), %xmm0 # return a 1
+ jmp .Lcos_cleanup
+
+## else
+.align 16
+.Lcos_smaller:
+# cos = 1.0 - x*x*0.5;
+ movsd %xmm0, %xmm2
+ mulsd %xmm2, %xmm2 # x^2
+ movsd .L__real_3ff0000000000000(%rip), %xmm0 # 1.0
+ mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 * x^2
+ subsd %xmm2, %xmm0
+ jmp .Lcos_cleanup
+
+## else
+
+.align 16
+.Lcos_small:
+# cos = cos_piby4(x, 0.0);
+ movsd %xmm0, %xmm2
+ mulsd %xmm0, %xmm2 # x2
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2 - do a cos calculation
+# zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6))));
+
+ movsd .Lcosarray+0x10(%rip), %xmm1 # c2
+ movsd %xmm2, %xmm4 # move for x4
+ mulsd %xmm2, %xmm4 # x4
+ movsd .Lcosarray+0x30(%rip), %xmm3 # c4
+ mulsd %xmm2, %xmm1 # c2x2
+ movsd .Lcosarray+0x50(%rip), %xmm5 # c6
+ mulsd %xmm2, %xmm3 # c4x2
+ movsd %xmm4, %xmm0 # move for x8
+ mulsd %xmm2, %xmm5 # c6x2
+ mulsd %xmm4, %xmm0 # x8
+ addsd .Lcosarray(%rip), %xmm1 # c1 + c2x2
+ mulsd %xmm4, %xmm1 # c1x4 + c2x6
+ addsd .Lcosarray+0x20(%rip), %xmm3 # c3 + c4x2
+ mulsd .L__real_bfe0000000000000(%rip), %xmm2 # -0.5x2, destroy xmm2
+ addsd .Lcosarray+0x40(%rip), %xmm5 # c5 + c6x2
+ mulsd %xmm0, %xmm3 # c3x8 + c4x10
+ mulsd %xmm0, %xmm4 # x12
+ mulsd %xmm5, %xmm4 # c5x12 + c6x14
+
+ movsd .L__real_3ff0000000000000(%rip), %xmm0 # 1
+ addsd %xmm3, %xmm1 # c1x4 + c2x6 + c3x8 + c4x10
+ movsd %xmm2, %xmm3 # preserve -0.5x2
+ addsd %xmm0, %xmm2 # t = 1 - 0.5x2
+ subsd %xmm2, %xmm0 # 1-t
+ addsd %xmm3, %xmm0 # (1-t) - r
+ addsd %xmm4, %xmm1 # c1x4 + c2x6 + c3x8 + c4x10 + c5x12 + c6x14
+ addsd %xmm1, %xmm0 # (1-t) - r + c1x4 + c2x6 + c3x8 + c4x10 + c5x12 + c6x14
+ addsd %xmm2, %xmm0 # 1 - 0.5x2 + above
+
+ jmp .Lcos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcos_reduce:
+
+# xneg = (ax != ux);
+ cmp %r10, %rdx
+
+## if (xneg) x = -x;
+ jz .Lpositive
+ subsd %xmm0, %xmm2
+ movsd %xmm2, %xmm0
+
+.Lpositive:
+## if (x < 5.0e5)
+ cmp .L__real_411E848000000000(%rip), %r10
+ jae .Lcos_reduce_precise
+
+# reduce the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+ movsd %xmm0, %xmm2
+ movsd .L__real_3fe45f306dc9c883(%rip), %xmm3 # twobypi
+ movsd %xmm0, %xmm4
+ movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5
+ mulsd %xmm3, %xmm2
+
+#/* How many pi/2 is x a multiple of? */
+# xexp = ax >> EXPSHIFTBITS_DP64;
+ mov %r10, %r9
+ shr $52, %r9 # >>EXPSHIFTBITS_DP64
+
+# npi2 = (int)(x * twobypi + 0.5);
+ addsd %xmm5, %xmm2 # npi2
+
+ movsd .L__real_3ff921fb54400000(%rip), %xmm3 # piby2_1
+ cvttpd2dq %xmm2, %xmm0 # convert to integer
+ movsd .L__real_3dd0b4611a626331(%rip), %xmm1 # piby2_1tail
+ cvtdq2pd %xmm0, %xmm2 # and back to float.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+# rhead = x - npi2 * piby2_1;
+ mulsd %xmm2, %xmm3
+ subsd %xmm3, %xmm4 # rhead
+
+# rtail = npi2 * piby2_1tail;
+ mulsd %xmm2, %xmm1
+ movd %xmm0, %eax
+
+# GET_BITS_DP64(rhead-rtail, uy);
+ movsd %xmm4, %xmm0
+ subsd %xmm1, %xmm0
+
+ movsd .L__real_3dd0b4611a600000(%rip), %xmm3 # piby2_2
+ movsd %xmm0,p_temp(%rsp)
+ movsd .L__real_3ba3198a2e037073(%rip), %xmm5 # piby2_2tail
+ mov p_temp(%rsp), %rcx # rcx is rhead-rtail
+
+# xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc
+# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ shl $1, %rcx # strip any sign bit
+ shr $53, %rcx # >> EXPSHIFTBITS_DP64 +1
+ sub %rcx, %r9 # expdiff
+
+## if (expdiff > 15)
+ cmp $15, %r9
+ jle .Lexpdiffless15
+
+# /* The remainder is pretty small compared with x, which
+# implies that x is a near multiple of pi/2
+# (x matches the multiple to at least 15 bits) */
+
+# t = rhead;
+ movsd %xmm4, %xmm1
+
+# rtail = npi2 * piby2_2;
+ mulsd %xmm2, %xmm3
+
+# rhead = t - rtail;
+ mulsd %xmm2, %xmm5 # npi2 * piby2_2tail
+ subsd %xmm3, %xmm4 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ subsd %xmm4, %xmm1 # t - rhead
+ subsd %xmm3, %xmm1 # -rtail
+ subsd %xmm1, %xmm5 # rtail
+
+# r = rhead - rtail;
+ movsd %xmm4, %xmm0
+
+#HARSHA
+#xmm1=rtail
+ movsd %xmm5, %xmm1
+ subsd %xmm5, %xmm0
+
+# xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexpdiffless15:
+# region = npi2 & 3;
+
+ subsd %xmm0, %xmm4 # rhead-r
+ subsd %xmm1, %xmm4 # rr = (rhead-r) - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+## if the input was close to a pi/2 multiple
+# The original NAG code missed this trick. If the input is very close to n*pi/2 after
+# reduction,
+# then the cos is ~ 1.0 , to within 53 bits, when r is < 2^-27. We already
+# have x at this point, so we can skip the cos polynomials.
+
+ cmp $0x03f2, %rcx # if r small.
+ jge .Lcos_piby4 # use taylor series if not
+ cmp $0x03de, %rcx # if r really small.
+ jle .Lr_small # then cos(r) = 1
+
+ movsd %xmm0, %xmm2
+ mulsd %xmm2, %xmm2 # x^2
+
+## if region is 1 or 3 do a sin calc.
+ and %eax, %r8d
+ jz .Lsinsmall
+
+# region 1 or 3
+# use simply polynomial
+# *s = x - x*x*x*0.166666666666666666;
+ movsd .L__real_3fc5555555555555(%rip), %xmm3
+ mulsd %xmm0, %xmm3 # * x
+ mulsd %xmm2, %xmm3 # * x^2
+ subsd %xmm3, %xmm0 # xs
+ jmp .Ladjust_region
+
+.align 16
+.Lsinsmall:
+# region 0 or 2
+# cos = 1.0 - x*x*0.5;
+ movsd .L__real_3ff0000000000000(%rip), %xmm0 # 1.0
+ mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x^2
+ subsd %xmm2, %xmm0
+ jmp .Ladjust_region
+
+.align 16
+.Lr_small:
+## if region is 1 or 3 do a sin calc.
+ and %eax, %r8d
+ jnz .Ladjust_region
+
+ movsd .L__real_3ff0000000000000(%rip), %xmm0 # cos(r) is a 1
+ jmp .Ladjust_region
+
+.align 32
+.Lcos_reduce_precise:
+# // Reduce x into range [-pi/4,pi/4]
+# __amd_remainder_piby2(x, &r, &rr, ®ion);
+
+ lea region(%rsp), %rdx
+ lea rr(%rsp), %rsi
+ lea r(%rsp), %rdi
+
+ call __amd_remainder_piby2@PLT
+
+ mov $1, %r8d # for determining region later on
+ movsd r(%rsp), %xmm0 # x
+ movsd rr(%rsp), %xmm4 # xx
+ mov region(%rsp), %eax # region
+
+# xmm0 = x, xmm4 = xx, r8d = 1, eax= region
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 32
+# perform taylor series to calc sinx, cosx
+.Lcos_piby4:
+# x2 = r * r;
+
+#xmm4 = a part of rr for the sin path, xmm4 is overwritten in the cos path
+#instead use xmm3 because that was freed up in the sin path, xmm3 is overwritten in sin path
+ movsd %xmm0, %xmm3
+ movsd %xmm0, %xmm2
+ mulsd %xmm0, %xmm2 # x2
+
+## if region is 1 or 3 do a sin calc.
+ and %eax, %r8d
+ jz .Lcospiby4
+
+# region 1 or 3
+ movsd .Lsinarray+0x50(%rip), %xmm3 # s6
+ mulsd %xmm2, %xmm3 # x2s6
+ movsd .Lsinarray+0x20(%rip), %xmm5 # s3
+ movsd %xmm4,p_temp(%rsp) # store xx
+ movsd %xmm2, %xmm1 # move for x4
+ mulsd %xmm2, %xmm1 # x4
+ movsd %xmm0,p_temp1(%rsp) # store x
+ mulsd %xmm2, %xmm5 # x2s3
+ movsd %xmm0, %xmm4 # move for x3
+ addsd .Lsinarray+0x40(%rip), %xmm3 # s5+x2s6
+ mulsd %xmm2, %xmm1 # x6
+ mulsd %xmm2, %xmm3 # x2(s5+x2s6)
+ mulsd %xmm2, %xmm4 # x3
+ addsd .Lsinarray+0x10(%rip), %xmm5 # s2+x2s3
+ mulsd %xmm2, %xmm5 # x2(s2+x2s3)
+ addsd .Lsinarray+0x30(%rip), %xmm3 # s4 + x2(s5+x2s6)
+ mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x2
+ movsd p_temp(%rsp), %xmm0 # load xx
+ mulsd %xmm1, %xmm3 # x6(s4 + x2(s5+x2s6))
+ addsd .Lsinarray(%rip), %xmm5 # s1+x2(s2+x2s3)
+ mulsd %xmm0, %xmm2 # 0.5 * x2 *xx
+ addsd %xmm5, %xmm3 # zs
+ mulsd %xmm3, %xmm4 # *x3
+ subsd %xmm2, %xmm4 # x3*zs - 0.5 * x2 *xx
+ addsd %xmm4, %xmm0 # +xx
+ addsd p_temp1(%rsp), %xmm0 # +x
+
+ jmp .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcospiby4:
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2 - do a cos calculation
+# zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6))));
+ mulsd %xmm0, %xmm4 # x*xx
+ movsd .L__real_3fe0000000000000(%rip), %xmm5
+ movsd .Lcosarray+0x50(%rip), %xmm1 # c6
+ movsd .Lcosarray+0x20(%rip), %xmm0 # c3
+ mulsd %xmm2, %xmm5 # r = 0.5 *x2
+ movsd %xmm2, %xmm3 # copy of x2
+ movsd %xmm4,p_temp(%rsp) # store x*xx
+ mulsd %xmm2, %xmm1 # c6*x2
+ mulsd %xmm2, %xmm0 # c3*x2
+ subsd .L__real_3ff0000000000000(%rip), %xmm5 # -t=r-1.0 ;trash r
+ mulsd %xmm2, %xmm3 # x4
+ addsd .Lcosarray+0x40(%rip), %xmm1 # c5+x2c6
+ addsd .Lcosarray+0x10(%rip), %xmm0 # c2+x2C3
+ addsd .L__real_3ff0000000000000(%rip), %xmm5 # 1 + (-t) ;trash t
+ mulsd %xmm2, %xmm3 # x6
+ mulsd %xmm2, %xmm1 # x2(c5+x2c6)
+ mulsd %xmm2, %xmm0 # x2(c2+x2C3)
+ movsd %xmm2, %xmm4 # copy of x2
+ mulsd .L__real_3fe0000000000000(%rip), %xmm4 # r recalculate
+ addsd .Lcosarray+0x30(%rip), %xmm1 # c4 + x2(c5+x2c6)
+ addsd .Lcosarray(%rip), %xmm0 # c1+x2(c2+x2C3)
+ mulsd %xmm2, %xmm2 # x4 recalculate
+ subsd %xmm4, %xmm5 # (1 + (-t)) - r
+ mulsd %xmm3, %xmm1 # x6(c4 + x2(c5+x2c6))
+ addsd %xmm1, %xmm0 # zc
+ subsd .L__real_3ff0000000000000(%rip), %xmm4 # t relaculate
+ subsd p_temp(%rsp), %xmm5 # ((1 + (-t)) - r) - x*xx
+ mulsd %xmm2, %xmm0 # x4 * zc
+ addsd %xmm5, %xmm0 # x4 * zc + ((1 + (-t)) - r -x*xx)
+ subsd %xmm4, %xmm0 # result - (-t)
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 32
+.Ladjust_region: # positive or negative (0, 1, 2, 3)=>(1, 2, 3 ,4)=>(0, 2, 2,0)
+# switch (region)
+ add $1, %eax
+ and $2, %eax
+ jz .Lcos_cleanup
+## if the original region 1 or 2 then we negate the result.
+ movsd %xmm0, %xmm2
+ xorpd %xmm0, %xmm0
+ subsd %xmm2, %xmm0
+
+.align 32
+.Lcos_cleanup:
+ add $stack_size, %rsp
+ ret
+
+.align 32
+.Lcos_naninf:
+ call fname_special
+ add $stack_size, %rsp
+ ret
+
+
+
diff --git a/src/gas/cosf.S b/src/gas/cosf.S
new file mode 100644
index 0000000..43eae9a
--- /dev/null
+++ b/src/gas/cosf.S
@@ -0,0 +1,372 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# An implementation of the cosf function.
+#
+# Prototype:
+#
+# float fastcosf(float x);
+#
+# Computes cosf(x).
+# Based on the NAG C implementation.
+# It will provide proper C99 return values,
+# but may not raise floating point status bits properly.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 32
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0 # for alignment
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0
+.L__real_3FF921FB54442D18: .quad 0x03FF921FB54442D18 # piby2
+ .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0
+.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5
+ .quad 0
+
+.align 32
+.Lcsarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4
+
+.text
+.align 32
+.p2align 5,,31
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(cosf)
+#define fname_special _cosf_special@PLT
+
+# define local variable storage offsets
+.equ p_temp, 0x30 # temporary for get/put bits operation
+.equ p_temp1, 0x40 # temporary for get/put bits operation
+.equ region, 0x50 # pointer to region for amd_remainder_piby2
+.equ r, 0x60 # pointer to r for amd_remainder_piby2
+.equ stack_size, 0x88
+
+.globl fname
+.type fname,@function
+
+fname:
+
+ sub $stack_size, %rsp
+
+## if NaN or inf
+ movd %xmm0, %edx
+ mov $0x07f800000, %eax
+ mov %eax, %r10d
+ and %edx, %r10d
+ cmp %eax, %r10d
+ jz .Lcosf_naninf
+
+ xorpd %xmm2, %xmm2
+ mov %rdx, %r11 # save 1st return value pointer
+
+# GET_BITS_DP64(x, ux);
+# convert input to double.
+ cvtss2sd %xmm0, %xmm0
+
+# get the input value to an integer register.
+ movsd %xmm0,p_temp(%rsp)
+ mov p_temp(%rsp), %rdx # rdx is ux
+
+# ax = (ux & ~SIGNBIT_DP64);
+ mov $0x07fffffffffffffff, %r10
+ and %rdx, %r10 # r10 is ax
+
+ mov $1, %r8d # for determining region later on
+ movsd %xmm0, %xmm1 # copy x to xmm1
+
+
+## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+ mov $0x03fe921fb54442d18, %rax
+ cmp %rax, %r10
+ jg .L__sc_reducec
+
+# *c = cos_piby4(x, 0.0);
+ movsd %xmm0, %xmm2
+ mulsd %xmm2, %xmm2 # x^2
+ xor %eax, %eax
+ mov %r10, %rdx
+ movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5
+ jmp .L__sc_piby4c
+
+.align 32
+.L__sc_reducec:
+# reduce the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+# xneg = (ax != ux);
+ cmp %r10, %rdx
+## if (xneg) x = -x;
+ jz .Lpositive
+ subsd %xmm0, %xmm2
+ movsd %xmm2, %xmm0
+
+.Lpositive:
+## if (x < 5.0e5)
+ cmp .L__real_411E848000000000(%rip), %r10
+ jae .Lcosf_reduce_precise
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+# perform taylor series to calc cosx, cosx
+# xmm0=abs(x), xmm1=x
+.align 32
+.Lcosf_piby4:
+#/* How many pi/2 is x a multiple of? */
+# npi2 = (int)(x * twobypi + 0.5);
+
+ movsd %xmm0, %xmm2
+ movsd %xmm0, %xmm4
+
+ mulsd .L__real_3fe45f306dc9c883(%rip), %xmm2 # twobypi
+ movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5
+
+#/* How many pi/2 is x a multiple of? */
+
+# xexp = ax >> EXPSHIFTBITS_DP64;
+ mov %r10, %r9
+ shr $52, %r9 # >> EXPSHIFTBITS_DP64
+
+# npi2 = (int)(x * twobypi + 0.5);
+ addsd %xmm5, %xmm2 # npi2
+
+ movsd .L__real_3ff921fb54400000(%rip), %xmm3 # piby2_1
+ cvttpd2dq %xmm2, %xmm0 # convert to integer
+ movsd .L__real_3dd0b4611a626331(%rip), %xmm1 # piby2_1tail
+ cvtdq2pd %xmm0, %xmm2 # and back to double
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+# rhead = x - npi2 * piby2_1;
+
+ mulsd %xmm2, %xmm3 # use piby2_1
+ subsd %xmm3, %xmm4 # rhead
+
+# rtail = npi2 * piby2_1tail;
+ mulsd %xmm2, %xmm1 # rtail
+ movd %xmm0, %eax
+
+# GET_BITS_DP64(rhead-rtail, uy);
+ movsd %xmm4, %xmm0
+ subsd %xmm1, %xmm0
+
+ movsd .L__real_3dd0b4611a600000(%rip), %xmm3 # piby2_2
+ movsd .L__real_3ba3198a2e037073(%rip), %xmm5 # piby2_2tail
+ movd %xmm0, %rcx # rcx is rhead-rtail
+
+# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ shl $1, %rcx # strip any sign bit
+ shr $53, %rcx # >> EXPSHIFTBITS_DP64 +1
+ sub %rcx, %r9 # expdiff
+
+## if (expdiff > 15)
+ cmp $15, %r9
+ jle .Lexpdiffless15
+
+# /* The remainder is pretty small compared with x, which
+# implies that x is a near multiple of pi/2
+# (x matches the multiple to at least 15 bits) */
+
+# t = rhead;
+ movsd %xmm4, %xmm1
+
+# rtail = npi2 * piby2_2;
+ mulsd %xmm2, %xmm3
+
+# rhead = t - rtail;
+ mulsd %xmm2, %xmm5 # npi2 * piby2_2tail
+ subsd %xmm3, %xmm4 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ subsd %xmm4, %xmm1 # t - rhead
+ subsd %xmm3, %xmm1 # -rtail
+ subsd %xmm1, %xmm5 # rtail
+
+# r = rhead - rtail;
+ movsd %xmm4, %xmm0
+
+#HARSHA
+#xmm1=rtail
+ movsd %xmm5, %xmm1
+ subsd %xmm5, %xmm0
+
+# xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexpdiffless15:
+# region = npi2 & 3;
+
+ movsd %xmm0, %xmm2
+ mulsd %xmm0, %xmm2 #x^2
+ movsd %xmm0, %xmm1
+ movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5
+
+ cmp $0x03f2, %rcx # if r small.
+ jge .L__sc_piby4c # use taylor series if not
+ cmp $0x03de, %rcx # if r really small.
+ jle .L__rc_small # then cos(r) = 1
+
+## if region is 1 or 3 do a sin calc.
+ and %eax, %r8d
+ jz .Lsinsmall
+# region 1 or 3
+# use simply polynomial
+# *s = x - x*x*x*0.166666666666666666;
+ movsd .L__real_3fc5555555555555(%rip), %xmm3
+ mulsd %xmm1, %xmm3 # * x
+ mulsd %xmm2, %xmm3 # * x^2
+ subsd %xmm3, %xmm1 # xs
+ jmp .L__adjust_region_cos
+
+.align 16
+.Lsinsmall:
+# region 0 or 2
+# cos = 1.0 - x*x*0.5;
+ movsd .L__real_3ff0000000000000(%rip), %xmm1 # 1.0
+ mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x^2
+ subsd %xmm2, %xmm1
+ jmp .L__adjust_region_cos
+
+.align 16
+.L__rc_small: # then sin(r) = r
+## if region is 1 or 3 do a sin calc.
+ and %eax, %r8d
+ jnz .L__adjust_region_cos
+ movsd .L__real_3ff0000000000000(%rip), %xmm1 # cos(r) is a 1
+ jmp .L__adjust_region_cos
+
+
+# done with reducing the argument. Now perform the sin/cos calculations.
+.align 16
+.L__sc_piby4c:
+## if region is 1 or 3 do a sin calc.
+ and %eax, %r8d
+ jz .Lcospiby4
+
+ movsd .Lcsarray+0x30(%rip), %xmm1 # c4
+ movsd %xmm2, %xmm4
+ mulsd %xmm2, %xmm1 # x2c4
+ movsd .Lcsarray+0x10(%rip), %xmm3 # c2
+ mulsd %xmm4, %xmm4 # x4
+ mulsd %xmm2, %xmm3 # x2c2
+ mulsd %xmm0, %xmm2 # x3
+ addsd .Lcsarray+0x20(%rip), %xmm1 # c3 + x2c4
+ mulsd %xmm4, %xmm1 # x4(c3 + x2c4)
+ addsd .Lcsarray(%rip), %xmm3 # c1 + x2c2
+ addsd %xmm3, %xmm1 # c1 + c2x2 + c3x4 + c4x6
+ mulsd %xmm2, %xmm1 # c1x3 + c2x5 + c3x7 + c4x9
+ addsd %xmm0, %xmm1 # x + c1x3 + c2x5 + c3x7 + c4x9
+
+ jmp .L__adjust_region_cos
+
+.align 16
+.Lcospiby4:
+# region 0 or 2 - do a cos calculation
+ movsd .Lcsarray+0x38(%rip), %xmm1 # c4
+ movsd %xmm2, %xmm4
+ mulsd %xmm2, %xmm1 # x2c4
+ movsd .Lcsarray+0x18(%rip), %xmm3 # c2
+ mulsd %xmm4, %xmm4 # x4
+ mulsd %xmm2, %xmm3 # x2c2
+ mulsd %xmm2, %xmm5 # 0.5 * x2
+ addsd .Lcsarray+0x28(%rip), %xmm1 # c3 + x2c4
+ mulsd %xmm4, %xmm1 # x4(c3 + x2c4)
+ addsd .Lcsarray+8(%rip), %xmm3 # c1 + x2c2
+ addsd %xmm3, %xmm1 # c1 + x2c2 + c3x4 + c4x6
+ mulsd %xmm4, %xmm1 # x4(c1 + c2x2 + c3x4 + c4x6)
+
+# -t = rc-1;
+ subsd .L__real_3ff0000000000000(%rip), %xmm5 # 0.5x2 - 1
+ subsd %xmm5, %xmm1 # cos = 1 - 0.5x2 + c1x4 + c2x6 + c3x8 + c4x10
+
+.L__adjust_region_cos: # xmm1 is cos or sin, relies on previous sections to
+# switch (region)
+ add $1, %eax
+ and $2, %eax
+ jz .L__cos_cleanup
+## if region 1 or 2 then we negate the result.
+ xorpd %xmm2, %xmm2
+ subsd %xmm1, %xmm2
+ movsd %xmm2, %xmm1
+
+.align 16
+.L__cos_cleanup:
+ cvtsd2ss %xmm1, %xmm0
+ add $stack_size, %rsp
+ ret
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lcosf_reduce_precise:
+# /* Reduce abs(x) into range [-pi/4,pi/4] */
+# __amd_remainder_piby2(ax, &r, ®ion);
+
+ mov %rdx,p_temp(%rsp) # save ux for use later
+ mov %r10,p_temp1(%rsp) # save ax for use later
+ movd %xmm0, %rdi
+ lea r(%rsp), %rsi
+ lea region(%rsp), %rdx
+ sub $0x020, %rsp
+
+ call __amd_remainder_piby2d2f@PLT
+
+ add $0x020, %rsp
+ mov p_temp(%rsp), %rdx # restore ux for use later
+ mov p_temp1(%rsp), %r10 # restore ax for use later
+ mov $1, %r8d # for determining region later on
+ movsd r(%rsp), %xmm0 # r
+ mov region(%rsp), %eax # region
+
+ movsd %xmm0, %xmm2
+ mulsd %xmm0, %xmm2 # x^2
+ movsd %xmm0, %xmm1
+ movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5
+
+ jmp .L__sc_piby4c
+
+.align 32
+.Lcosf_naninf:
+ call fname_special
+ add $stack_size, %rsp
+ ret
diff --git a/src/gas/exp.S b/src/gas/exp.S
new file mode 100644
index 0000000..153e8a6
--- /dev/null
+++ b/src/gas/exp.S
@@ -0,0 +1,400 @@
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+#ifdef __x86_64__
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# exp.S
+#
+# An implementation of the exp libm function.
+#
+# Prototype:
+#
+# double exp(double x);
+#
+
+#
+# Algorithm:
+#
+# e^x = 2^(x/ln(2)) = 2^(x*(64/ln(2))/64)
+#
+# x*(64/ln(2)) = n + f, |f| <= 0.5, n is integer
+# n = 64*m + j, 0 <= j < 64
+#
+# e^x = 2^((64*m + j + f)/64)
+# = (2^m) * (2^(j/64)) * 2^(f/64)
+# = (2^m) * (2^(j/64)) * e^(f*(ln(2)/64))
+#
+# f = x*(64/ln(2)) - n
+# r = f*(ln(2)/64) = x - n*(ln(2)/64)
+#
+# e^x = (2^m) * (2^(j/64)) * e^r
+#
+# (2^(j/64)) is precomputed
+#
+# e^r = 1 + r + (r^2)/2! + (r^3)/3! + (r^4)/4! + (r^5)/5! + (r^5)/5!
+# e^r = 1 + q
+#
+# q = r + (r^2)/2! + (r^3)/3! + (r^4)/4! + (r^5)/5! + (r^5)/5!
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(exp)
+#define fname_special _exp_special@PLT
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+ ucomisd .L__max_exp_arg(%rip), %xmm0
+ ja .L__y_is_inf
+ jp .L__y_is_nan
+ ucomisd .L__denormal_tiny_threshold(%rip), %xmm0
+ jbe .L__y_is_zero
+
+ # x * (64/ln(2))
+ movapd %xmm0,%xmm1
+ mulsd .L__real_64_by_log2(%rip), %xmm1
+
+ # n = int( x * (64/ln(2)) )
+ cvttpd2dq %xmm1, %xmm2 #xmm2 = (int)n
+ cvtdq2pd %xmm2, %xmm1 #xmm1 = (double)n
+ movd %xmm2, %ecx
+ movapd %xmm1,%xmm2
+ # r1 = x - n * ln(2)/64 head
+ mulsd .L__log2_by_64_mhead(%rip),%xmm1
+
+ #j = n & 0x3f
+ mov $0x3f, %rax
+ and %ecx, %eax #eax = j
+ # m = (n - j) / 64
+ sar $6, %ecx #ecx = m
+
+
+ # r2 = - n * ln(2)/64 tail
+ mulsd .L__log2_by_64_mtail(%rip),%xmm2
+ addsd %xmm1,%xmm0 #xmm0 = r1
+
+ # r1+r2
+ addsd %xmm0, %xmm2 #xmm2 = r
+
+ # q = r + r^2*1/2 + r^3*1/6 + r^4 *1/24 + r^5*1/120 + r^6*1/720
+ # q = r + r*r*(1/2 + r*(1/6+ r*(1/24 + r*(1/120 + r*(1/720)))))
+ movapd .L__real_1_by_720(%rip), %xmm3 #xmm3 = 1/720
+ mulsd %xmm2, %xmm3 #xmm3 = r*1/720
+ movapd .L__real_1_by_6(%rip), %xmm0 #xmm0 = 1/6
+ movapd %xmm2, %xmm1 #xmm1 = r
+ mulsd %xmm2, %xmm0 #xmm0 = r*1/6
+ addsd .L__real_1_by_120(%rip), %xmm3 #xmm3 = 1/120 + (r*1/720)
+ mulsd %xmm2, %xmm1 #xmm1 = r*r
+ addsd .L__real_1_by_2(%rip), %xmm0 #xmm0 = 1/2 + (r*1/6)
+ movapd %xmm1, %xmm4 #xmm4 = r*r
+ mulsd %xmm1, %xmm4 #xmm4 = (r*r) * (r*r)
+ mulsd %xmm2, %xmm3 #xmm3 = r * (1/120 + (r*1/720))
+ mulsd %xmm1, %xmm0 #xmm0 = (r*r)*(1/2 + (r*1/6))
+ addsd .L__real_1_by_24(%rip), %xmm3 #xmm3 = 1/24 + (r * (1/120 + (r*1/720)))
+ addsd %xmm2, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6)))
+ mulsd %xmm4, %xmm3 #xmm3 = ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+ addsd %xmm3, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+
+ # (f)*(q) + f2 + f1
+ cmp $0xfffffc02, %ecx # -1022
+ lea .L__two_to_jby64_table(%rip), %rdx
+ lea .L__two_to_jby64_tail_table(%rip), %r11
+ lea .L__two_to_jby64_head_table(%rip), %r10
+ mulsd (%rdx,%rax,8), %xmm0
+ addsd (%r11,%rax,8), %xmm0
+ addsd (%r10,%rax,8), %xmm0
+
+ jle .L__process_denormal
+.L__process_normal:
+ shl $52, %rcx
+ movd %rcx,%xmm2
+ paddq %xmm2, %xmm0
+ ret
+
+.p2align 4
+.L__process_denormal:
+ jl .L__process_true_denormal
+ ucomisd .L__real_one(%rip), %xmm0
+ jae .L__process_normal
+.L__process_true_denormal:
+ # here ( e^r < 1 and m = -1022 ) or m <= -1023
+ add $1074, %ecx
+ mov $1, %rax
+ shl %cl, %rax
+ movd %rax, %xmm2
+ mulsd %xmm2, %xmm0
+ ret
+
+.p2align 4
+.L__y_is_inf:
+ mov $0x7ff0000000000000,%rax
+ movd %rax, %xmm1
+ mov $3, %edi
+ jmp fname_special
+
+.p2align 4
+.L__y_is_nan:
+ movapd %xmm0,%xmm1
+ addsd %xmm0,%xmm1
+ mov $1, %edi
+ jmp fname_special
+
+.p2align 4
+.L__y_is_zero:
+ ucomisd .L__min_exp_arg(%rip),%xmm0
+ jbe .L__return_zero
+ movapd .L__real_smallest_denormal(%rip), %xmm0
+ ret
+
+.p2align 4
+.L__return_zero:
+ pxor %xmm1,%xmm1
+ mov $2, %edi
+ jmp fname_special
+
+.data
+.align 16
+.L__max_exp_arg: .quad 0x40862e42fefa39ef
+.L__denormal_tiny_threshold: .quad 0xc0874046dfefd9d0
+.L__min_exp_arg: .quad 0xc0874910d52d3051
+.L__real_64_by_log2: .quad 0x40571547652b82fe # 64/ln(2)
+
+.align 16
+.L__log2_by_64_mhead: .quad 0xbf862e42fefa0000
+.L__log2_by_64_mtail: .quad 0xbd1cf79abc9e3b39
+.L__real_1_by_720: .quad 0x3f56c16c16c16c17 # 1/720
+.L__real_1_by_120: .quad 0x3f81111111111111 # 1/120
+.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6
+.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2
+.L__real_1_by_24: .quad 0x3fa5555555555555 # 1/24
+.L__real_one: .quad 0x3ff0000000000000
+.L__real_smallest_denormal: .quad 0x0000000000000001
+
+
+.align 16
+.L__two_to_jby64_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff02c9a3e778061
+ .quad 0x3ff059b0d3158574
+ .quad 0x3ff0874518759bc8
+ .quad 0x3ff0b5586cf9890f
+ .quad 0x3ff0e3ec32d3d1a2
+ .quad 0x3ff11301d0125b51
+ .quad 0x3ff1429aaea92de0
+ .quad 0x3ff172b83c7d517b
+ .quad 0x3ff1a35beb6fcb75
+ .quad 0x3ff1d4873168b9aa
+ .quad 0x3ff2063b88628cd6
+ .quad 0x3ff2387a6e756238
+ .quad 0x3ff26b4565e27cdd
+ .quad 0x3ff29e9df51fdee1
+ .quad 0x3ff2d285a6e4030b
+ .quad 0x3ff306fe0a31b715
+ .quad 0x3ff33c08b26416ff
+ .quad 0x3ff371a7373aa9cb
+ .quad 0x3ff3a7db34e59ff7
+ .quad 0x3ff3dea64c123422
+ .quad 0x3ff4160a21f72e2a
+ .quad 0x3ff44e086061892d
+ .quad 0x3ff486a2b5c13cd0
+ .quad 0x3ff4bfdad5362a27
+ .quad 0x3ff4f9b2769d2ca7
+ .quad 0x3ff5342b569d4f82
+ .quad 0x3ff56f4736b527da
+ .quad 0x3ff5ab07dd485429
+ .quad 0x3ff5e76f15ad2148
+ .quad 0x3ff6247eb03a5585
+ .quad 0x3ff6623882552225
+ .quad 0x3ff6a09e667f3bcd
+ .quad 0x3ff6dfb23c651a2f
+ .quad 0x3ff71f75e8ec5f74
+ .quad 0x3ff75feb564267c9
+ .quad 0x3ff7a11473eb0187
+ .quad 0x3ff7e2f336cf4e62
+ .quad 0x3ff82589994cce13
+ .quad 0x3ff868d99b4492ed
+ .quad 0x3ff8ace5422aa0db
+ .quad 0x3ff8f1ae99157736
+ .quad 0x3ff93737b0cdc5e5
+ .quad 0x3ff97d829fde4e50
+ .quad 0x3ff9c49182a3f090
+ .quad 0x3ffa0c667b5de565
+ .quad 0x3ffa5503b23e255d
+ .quad 0x3ffa9e6b5579fdbf
+ .quad 0x3ffae89f995ad3ad
+ .quad 0x3ffb33a2b84f15fb
+ .quad 0x3ffb7f76f2fb5e47
+ .quad 0x3ffbcc1e904bc1d2
+ .quad 0x3ffc199bdd85529c
+ .quad 0x3ffc67f12e57d14b
+ .quad 0x3ffcb720dcef9069
+ .quad 0x3ffd072d4a07897c
+ .quad 0x3ffd5818dcfba487
+ .quad 0x3ffda9e603db3285
+ .quad 0x3ffdfc97337b9b5f
+ .quad 0x3ffe502ee78b3ff6
+ .quad 0x3ffea4afa2a490da
+ .quad 0x3ffefa1bee615a27
+ .quad 0x3fff50765b6e4540
+ .quad 0x3fffa7c1819e90d8
+
+.align 16
+.L__two_to_jby64_head_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff02c9a30000000
+ .quad 0x3ff059b0d0000000
+ .quad 0x3ff0874510000000
+ .quad 0x3ff0b55860000000
+ .quad 0x3ff0e3ec30000000
+ .quad 0x3ff11301d0000000
+ .quad 0x3ff1429aa0000000
+ .quad 0x3ff172b830000000
+ .quad 0x3ff1a35be0000000
+ .quad 0x3ff1d48730000000
+ .quad 0x3ff2063b80000000
+ .quad 0x3ff2387a60000000
+ .quad 0x3ff26b4560000000
+ .quad 0x3ff29e9df0000000
+ .quad 0x3ff2d285a0000000
+ .quad 0x3ff306fe00000000
+ .quad 0x3ff33c08b0000000
+ .quad 0x3ff371a730000000
+ .quad 0x3ff3a7db30000000
+ .quad 0x3ff3dea640000000
+ .quad 0x3ff4160a20000000
+ .quad 0x3ff44e0860000000
+ .quad 0x3ff486a2b0000000
+ .quad 0x3ff4bfdad0000000
+ .quad 0x3ff4f9b270000000
+ .quad 0x3ff5342b50000000
+ .quad 0x3ff56f4730000000
+ .quad 0x3ff5ab07d0000000
+ .quad 0x3ff5e76f10000000
+ .quad 0x3ff6247eb0000000
+ .quad 0x3ff6623880000000
+ .quad 0x3ff6a09e60000000
+ .quad 0x3ff6dfb230000000
+ .quad 0x3ff71f75e0000000
+ .quad 0x3ff75feb50000000
+ .quad 0x3ff7a11470000000
+ .quad 0x3ff7e2f330000000
+ .quad 0x3ff8258990000000
+ .quad 0x3ff868d990000000
+ .quad 0x3ff8ace540000000
+ .quad 0x3ff8f1ae90000000
+ .quad 0x3ff93737b0000000
+ .quad 0x3ff97d8290000000
+ .quad 0x3ff9c49180000000
+ .quad 0x3ffa0c6670000000
+ .quad 0x3ffa5503b0000000
+ .quad 0x3ffa9e6b50000000
+ .quad 0x3ffae89f90000000
+ .quad 0x3ffb33a2b0000000
+ .quad 0x3ffb7f76f0000000
+ .quad 0x3ffbcc1e90000000
+ .quad 0x3ffc199bd0000000
+ .quad 0x3ffc67f120000000
+ .quad 0x3ffcb720d0000000
+ .quad 0x3ffd072d40000000
+ .quad 0x3ffd5818d0000000
+ .quad 0x3ffda9e600000000
+ .quad 0x3ffdfc9730000000
+ .quad 0x3ffe502ee0000000
+ .quad 0x3ffea4afa0000000
+ .quad 0x3ffefa1be0000000
+ .quad 0x3fff507650000000
+ .quad 0x3fffa7c180000000
+
+.align 16
+.L__two_to_jby64_tail_table:
+ .quad 0x0000000000000000
+ .quad 0x3e6cef00c1dcdef9
+ .quad 0x3e48ac2ba1d73e2a
+ .quad 0x3e60eb37901186be
+ .quad 0x3e69f3121ec53172
+ .quad 0x3e469e8d10103a17
+ .quad 0x3df25b50a4ebbf1a
+ .quad 0x3e6d525bbf668203
+ .quad 0x3e68faa2f5b9bef9
+ .quad 0x3e66df96ea796d31
+ .quad 0x3e368b9aa7805b80
+ .quad 0x3e60c519ac771dd6
+ .quad 0x3e6ceac470cd83f5
+ .quad 0x3e5789f37495e99c
+ .quad 0x3e547f7b84b09745
+ .quad 0x3e5b900c2d002475
+ .quad 0x3e64636e2a5bd1ab
+ .quad 0x3e4320b7fa64e430
+ .quad 0x3e5ceaa72a9c5154
+ .quad 0x3e53967fdba86f24
+ .quad 0x3e682468446b6824
+ .quad 0x3e3f72e29f84325b
+ .quad 0x3e18624b40c4dbd0
+ .quad 0x3e5704f3404f068e
+ .quad 0x3e54d8a89c750e5e
+ .quad 0x3e5a74b29ab4cf62
+ .quad 0x3e5a753e077c2a0f
+ .quad 0x3e5ad49f699bb2c0
+ .quad 0x3e6a90a852b19260
+ .quad 0x3e56b48521ba6f93
+ .quad 0x3e0d2ac258f87d03
+ .quad 0x3e42a91124893ecf
+ .quad 0x3e59fcef32422cbe
+ .quad 0x3e68ca345de441c5
+ .quad 0x3e61d8bee7ba46e1
+ .quad 0x3e59099f22fdba6a
+ .quad 0x3e4f580c36bea881
+ .quad 0x3e5b3d398841740a
+ .quad 0x3e62999c25159f11
+ .quad 0x3e668925d901c83b
+ .quad 0x3e415506dadd3e2a
+ .quad 0x3e622aee6c57304e
+ .quad 0x3e29b8bc9e8a0387
+ .quad 0x3e6fbc9c9f173d24
+ .quad 0x3e451f8480e3e235
+ .quad 0x3e66bbcac96535b5
+ .quad 0x3e41f12ae45a1224
+ .quad 0x3e55e7f6fd0fac90
+ .quad 0x3e62b5a75abd0e69
+ .quad 0x3e609e2bf5ed7fa1
+ .quad 0x3e47daf237553d84
+ .quad 0x3e12f074891ee83d
+ .quad 0x3e6b0aa538444196
+ .quad 0x3e6cafa29694426f
+ .quad 0x3e69df20d22a0797
+ .quad 0x3e640f12f71a1e45
+ .quad 0x3e69f7490e4bb40b
+ .quad 0x3e4ed9942b84600d
+ .quad 0x3e4bdcdaf5cb4656
+ .quad 0x3e5e2cffd89cf44c
+ .quad 0x3e452486cc2c7b9d
+ .quad 0x3e6cc2b44eee3fa4
+ .quad 0x3e66dc8a80ce9f09
+ .quad 0x3e39e90d82e90a7e
+
+#endif
diff --git a/src/gas/exp10.S b/src/gas/exp10.S
new file mode 100644
index 0000000..009bbe0
--- /dev/null
+++ b/src/gas/exp10.S
@@ -0,0 +1,366 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(exp10)
+#define fname_special _exp10_special@PLT
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+ ucomisd .L__max_exp10_arg(%rip), %xmm0
+ jae .L__y_is_inf
+ jp .L__y_is_nan
+ ucomisd .L__min_exp10_arg(%rip), %xmm0
+ jbe .L__y_is_zero
+
+ # x * (64/log10(2))
+ movapd %xmm0,%xmm1
+ mulsd .L__real_64_by_log10of2(%rip), %xmm1
+
+ # n = int( x * (64/log10(2)) )
+ cvttpd2dq %xmm1, %xmm2 #xmm2 = (int)n
+ cvtdq2pd %xmm2, %xmm1 #xmm1 = (double)n
+ movd %xmm2, %ecx
+ movapd %xmm1,%xmm2
+ # r1 = x - n * log10(2)/64 head
+ mulsd .L__log10of2_by_64_mhead(%rip),%xmm1
+
+ #j = n & 0x3f
+ mov $0x3f, %rax
+ and %ecx, %eax #eax = j
+ # m = (n - j) / 64
+ sar $6, %ecx #ecx = m
+
+ # r2 = - n * log10(2)/64 tail
+ mulsd .L__log10of2_by_64_mtail(%rip),%xmm2 #xmm2 = r2
+ addsd %xmm1,%xmm0 #xmm0 = r1
+
+ # r1 *= ln10;
+ # r2 *= ln10;
+ mulsd .L__ln10(%rip),%xmm0
+ mulsd .L__ln10(%rip),%xmm2
+
+ # r1+r2
+ addsd %xmm0, %xmm2 #xmm2 = r
+
+ # q = r + r^2*1/2 + r^3*1/6 + r^4 *1/24 + r^5*1/120 + r^6*1/720
+ # q = r + r*r*(1/2 + r*(1/6+ r*(1/24 + r*(1/120 + r*(1/720)))))
+ movapd .L__real_1_by_720(%rip), %xmm3 #xmm3 = 1/720
+ mulsd %xmm2, %xmm3 #xmm3 = r*1/720
+ movapd .L__real_1_by_6(%rip), %xmm0 #xmm0 = 1/6
+ movapd %xmm2, %xmm1 #xmm1 = r
+ mulsd %xmm2, %xmm0 #xmm0 = r*1/6
+ addsd .L__real_1_by_120(%rip), %xmm3 #xmm3 = 1/120 + (r*1/720)
+ mulsd %xmm2, %xmm1 #xmm1 = r*r
+ addsd .L__real_1_by_2(%rip), %xmm0 #xmm0 = 1/2 + (r*1/6)
+ movapd %xmm1, %xmm4 #xmm4 = r*r
+ mulsd %xmm1, %xmm4 #xmm4 = (r*r) * (r*r)
+ mulsd %xmm2, %xmm3 #xmm3 = r * (1/120 + (r*1/720))
+ mulsd %xmm1, %xmm0 #xmm0 = (r*r)*(1/2 + (r*1/6))
+ addsd .L__real_1_by_24(%rip), %xmm3 #xmm3 = 1/24 + (r * (1/120 + (r*1/720)))
+ addsd %xmm2, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6)))
+ mulsd %xmm4, %xmm3 #xmm3 = ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+ addsd %xmm3, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+
+ # (f)*(q) + f2 + f1
+ cmp $0xfffffc02, %ecx # -1022
+ lea .L__two_to_jby64_table(%rip), %rdx
+ lea .L__two_to_jby64_tail_table(%rip), %r11
+ lea .L__two_to_jby64_head_table(%rip), %r10
+ mulsd (%rdx,%rax,8), %xmm0
+ addsd (%r11,%rax,8), %xmm0
+ addsd (%r10,%rax,8), %xmm0
+
+ jle .L__process_denormal
+.L__process_normal:
+ shl $52, %rcx
+ movd %rcx,%xmm2
+ paddq %xmm2, %xmm0
+ ret
+
+.p2align 4
+.L__process_denormal:
+ jl .L__process_true_denormal
+ ucomisd .L__real_one(%rip), %xmm0
+ jae .L__process_normal
+.L__process_true_denormal:
+ # here ( e^r < 1 and m = -1022 ) or m <= -1023
+ add $1074, %ecx
+ mov $1, %rax
+ shl %cl, %rax
+ movd %rax, %xmm2
+ mulsd %xmm2, %xmm0
+ ret
+
+.p2align 4
+.L__y_is_inf:
+ mov $0x7ff0000000000000,%rax
+ movd %rax, %xmm1
+ mov $3, %edi
+ #call fname_special
+ movdqa %xmm1,%xmm0 #remove this if call is made
+ ret
+
+.p2align 4
+.L__y_is_nan:
+ movapd %xmm0,%xmm1
+ addsd %xmm0,%xmm1
+ mov $1, %edi
+ #call fname_special
+ movdqa %xmm1,%xmm0 #remove this if call is made
+ ret
+
+.p2align 4
+.L__y_is_zero:
+ pxor %xmm1,%xmm1
+ mov $2, %edi
+ #call fname_special
+ movdqa %xmm1,%xmm0 #remove this if call is made
+ ret
+
+.data
+.align 16
+.L__max_exp10_arg: .quad 0x40734413509f79ff
+.L__min_exp10_arg: .quad 0xc07434e6420f4374
+.L__real_64_by_log10of2: .quad 0x406A934F0979A371 # 64/log10(2)
+.L__ln10: .quad 0x40026BB1BBB55516
+
+.align 16
+.L__log10of2_by_64_mhead: .quad 0xbF73441350000000
+.L__log10of2_by_64_mtail: .quad 0xbda3ef3fde623e25
+.L__real_1_by_720: .quad 0x3f56c16c16c16c17 # 1/720
+.L__real_1_by_120: .quad 0x3f81111111111111 # 1/120
+.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6
+.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2
+.L__real_1_by_24: .quad 0x3fa5555555555555 # 1/24
+.L__real_one: .quad 0x3ff0000000000000
+
+.align 16
+.L__two_to_jby64_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff02c9a3e778061
+ .quad 0x3ff059b0d3158574
+ .quad 0x3ff0874518759bc8
+ .quad 0x3ff0b5586cf9890f
+ .quad 0x3ff0e3ec32d3d1a2
+ .quad 0x3ff11301d0125b51
+ .quad 0x3ff1429aaea92de0
+ .quad 0x3ff172b83c7d517b
+ .quad 0x3ff1a35beb6fcb75
+ .quad 0x3ff1d4873168b9aa
+ .quad 0x3ff2063b88628cd6
+ .quad 0x3ff2387a6e756238
+ .quad 0x3ff26b4565e27cdd
+ .quad 0x3ff29e9df51fdee1
+ .quad 0x3ff2d285a6e4030b
+ .quad 0x3ff306fe0a31b715
+ .quad 0x3ff33c08b26416ff
+ .quad 0x3ff371a7373aa9cb
+ .quad 0x3ff3a7db34e59ff7
+ .quad 0x3ff3dea64c123422
+ .quad 0x3ff4160a21f72e2a
+ .quad 0x3ff44e086061892d
+ .quad 0x3ff486a2b5c13cd0
+ .quad 0x3ff4bfdad5362a27
+ .quad 0x3ff4f9b2769d2ca7
+ .quad 0x3ff5342b569d4f82
+ .quad 0x3ff56f4736b527da
+ .quad 0x3ff5ab07dd485429
+ .quad 0x3ff5e76f15ad2148
+ .quad 0x3ff6247eb03a5585
+ .quad 0x3ff6623882552225
+ .quad 0x3ff6a09e667f3bcd
+ .quad 0x3ff6dfb23c651a2f
+ .quad 0x3ff71f75e8ec5f74
+ .quad 0x3ff75feb564267c9
+ .quad 0x3ff7a11473eb0187
+ .quad 0x3ff7e2f336cf4e62
+ .quad 0x3ff82589994cce13
+ .quad 0x3ff868d99b4492ed
+ .quad 0x3ff8ace5422aa0db
+ .quad 0x3ff8f1ae99157736
+ .quad 0x3ff93737b0cdc5e5
+ .quad 0x3ff97d829fde4e50
+ .quad 0x3ff9c49182a3f090
+ .quad 0x3ffa0c667b5de565
+ .quad 0x3ffa5503b23e255d
+ .quad 0x3ffa9e6b5579fdbf
+ .quad 0x3ffae89f995ad3ad
+ .quad 0x3ffb33a2b84f15fb
+ .quad 0x3ffb7f76f2fb5e47
+ .quad 0x3ffbcc1e904bc1d2
+ .quad 0x3ffc199bdd85529c
+ .quad 0x3ffc67f12e57d14b
+ .quad 0x3ffcb720dcef9069
+ .quad 0x3ffd072d4a07897c
+ .quad 0x3ffd5818dcfba487
+ .quad 0x3ffda9e603db3285
+ .quad 0x3ffdfc97337b9b5f
+ .quad 0x3ffe502ee78b3ff6
+ .quad 0x3ffea4afa2a490da
+ .quad 0x3ffefa1bee615a27
+ .quad 0x3fff50765b6e4540
+ .quad 0x3fffa7c1819e90d8
+
+.align 16
+.L__two_to_jby64_head_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff02c9a30000000
+ .quad 0x3ff059b0d0000000
+ .quad 0x3ff0874510000000
+ .quad 0x3ff0b55860000000
+ .quad 0x3ff0e3ec30000000
+ .quad 0x3ff11301d0000000
+ .quad 0x3ff1429aa0000000
+ .quad 0x3ff172b830000000
+ .quad 0x3ff1a35be0000000
+ .quad 0x3ff1d48730000000
+ .quad 0x3ff2063b80000000
+ .quad 0x3ff2387a60000000
+ .quad 0x3ff26b4560000000
+ .quad 0x3ff29e9df0000000
+ .quad 0x3ff2d285a0000000
+ .quad 0x3ff306fe00000000
+ .quad 0x3ff33c08b0000000
+ .quad 0x3ff371a730000000
+ .quad 0x3ff3a7db30000000
+ .quad 0x3ff3dea640000000
+ .quad 0x3ff4160a20000000
+ .quad 0x3ff44e0860000000
+ .quad 0x3ff486a2b0000000
+ .quad 0x3ff4bfdad0000000
+ .quad 0x3ff4f9b270000000
+ .quad 0x3ff5342b50000000
+ .quad 0x3ff56f4730000000
+ .quad 0x3ff5ab07d0000000
+ .quad 0x3ff5e76f10000000
+ .quad 0x3ff6247eb0000000
+ .quad 0x3ff6623880000000
+ .quad 0x3ff6a09e60000000
+ .quad 0x3ff6dfb230000000
+ .quad 0x3ff71f75e0000000
+ .quad 0x3ff75feb50000000
+ .quad 0x3ff7a11470000000
+ .quad 0x3ff7e2f330000000
+ .quad 0x3ff8258990000000
+ .quad 0x3ff868d990000000
+ .quad 0x3ff8ace540000000
+ .quad 0x3ff8f1ae90000000
+ .quad 0x3ff93737b0000000
+ .quad 0x3ff97d8290000000
+ .quad 0x3ff9c49180000000
+ .quad 0x3ffa0c6670000000
+ .quad 0x3ffa5503b0000000
+ .quad 0x3ffa9e6b50000000
+ .quad 0x3ffae89f90000000
+ .quad 0x3ffb33a2b0000000
+ .quad 0x3ffb7f76f0000000
+ .quad 0x3ffbcc1e90000000
+ .quad 0x3ffc199bd0000000
+ .quad 0x3ffc67f120000000
+ .quad 0x3ffcb720d0000000
+ .quad 0x3ffd072d40000000
+ .quad 0x3ffd5818d0000000
+ .quad 0x3ffda9e600000000
+ .quad 0x3ffdfc9730000000
+ .quad 0x3ffe502ee0000000
+ .quad 0x3ffea4afa0000000
+ .quad 0x3ffefa1be0000000
+ .quad 0x3fff507650000000
+ .quad 0x3fffa7c180000000
+
+.align 16
+.L__two_to_jby64_tail_table:
+ .quad 0x0000000000000000
+ .quad 0x3e6cef00c1dcdef9
+ .quad 0x3e48ac2ba1d73e2a
+ .quad 0x3e60eb37901186be
+ .quad 0x3e69f3121ec53172
+ .quad 0x3e469e8d10103a17
+ .quad 0x3df25b50a4ebbf1a
+ .quad 0x3e6d525bbf668203
+ .quad 0x3e68faa2f5b9bef9
+ .quad 0x3e66df96ea796d31
+ .quad 0x3e368b9aa7805b80
+ .quad 0x3e60c519ac771dd6
+ .quad 0x3e6ceac470cd83f5
+ .quad 0x3e5789f37495e99c
+ .quad 0x3e547f7b84b09745
+ .quad 0x3e5b900c2d002475
+ .quad 0x3e64636e2a5bd1ab
+ .quad 0x3e4320b7fa64e430
+ .quad 0x3e5ceaa72a9c5154
+ .quad 0x3e53967fdba86f24
+ .quad 0x3e682468446b6824
+ .quad 0x3e3f72e29f84325b
+ .quad 0x3e18624b40c4dbd0
+ .quad 0x3e5704f3404f068e
+ .quad 0x3e54d8a89c750e5e
+ .quad 0x3e5a74b29ab4cf62
+ .quad 0x3e5a753e077c2a0f
+ .quad 0x3e5ad49f699bb2c0
+ .quad 0x3e6a90a852b19260
+ .quad 0x3e56b48521ba6f93
+ .quad 0x3e0d2ac258f87d03
+ .quad 0x3e42a91124893ecf
+ .quad 0x3e59fcef32422cbe
+ .quad 0x3e68ca345de441c5
+ .quad 0x3e61d8bee7ba46e1
+ .quad 0x3e59099f22fdba6a
+ .quad 0x3e4f580c36bea881
+ .quad 0x3e5b3d398841740a
+ .quad 0x3e62999c25159f11
+ .quad 0x3e668925d901c83b
+ .quad 0x3e415506dadd3e2a
+ .quad 0x3e622aee6c57304e
+ .quad 0x3e29b8bc9e8a0387
+ .quad 0x3e6fbc9c9f173d24
+ .quad 0x3e451f8480e3e235
+ .quad 0x3e66bbcac96535b5
+ .quad 0x3e41f12ae45a1224
+ .quad 0x3e55e7f6fd0fac90
+ .quad 0x3e62b5a75abd0e69
+ .quad 0x3e609e2bf5ed7fa1
+ .quad 0x3e47daf237553d84
+ .quad 0x3e12f074891ee83d
+ .quad 0x3e6b0aa538444196
+ .quad 0x3e6cafa29694426f
+ .quad 0x3e69df20d22a0797
+ .quad 0x3e640f12f71a1e45
+ .quad 0x3e69f7490e4bb40b
+ .quad 0x3e4ed9942b84600d
+ .quad 0x3e4bdcdaf5cb4656
+ .quad 0x3e5e2cffd89cf44c
+ .quad 0x3e452486cc2c7b9d
+ .quad 0x3e6cc2b44eee3fa4
+ .quad 0x3e66dc8a80ce9f09
+ .quad 0x3e39e90d82e90a7e
+
+
+
diff --git a/src/gas/exp10f.S b/src/gas/exp10f.S
new file mode 100644
index 0000000..da805e2
--- /dev/null
+++ b/src/gas/exp10f.S
@@ -0,0 +1,191 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(exp10f)
+#define fname_special _exp10f_special@PLT
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+ ucomiss .L__max_exp_arg(%rip), %xmm0
+ ja .L__y_is_inf
+ jp .L__y_is_nan
+ ucomiss .L__min_exp_arg(%rip), %xmm0
+ jb .L__y_is_zero
+
+ cvtps2pd %xmm0, %xmm0 #xmm0 = (double)x
+
+ # x * (64/log10of(2))
+ movapd %xmm0,%xmm3 #xmm3 = (xouble)x
+ mulsd .L__real_64_by_log10of2(%rip), %xmm3 #xmm3 = x * (64/ln(2)
+
+ # n = int( x * (64/log10of(2)) )
+ cvtpd2dq %xmm3, %xmm4 #xmm4 = (int)n
+ cvtdq2pd %xmm4, %xmm2 #xmm2 = (double)n
+
+ # r = x - n * ln(2)/64
+ # r *= ln(10)
+ mulsd .L__real_log10of2_by_64(%rip),%xmm2 #xmm2 = n * log10of(2)/64
+ movd %xmm4, %ecx #ecx = n
+ subsd %xmm2, %xmm0 #xmm0 = r
+ mulsd .L__real_ln10(%rip),%xmm0 #xmm0 = r = r*ln10
+ movapd %xmm0, %xmm1 #xmm1 = r
+
+ # q = r + r*r(1/2 + r*1/6)
+ movapd .L__real_1_by_6(%rip), %xmm3
+ mulsd %xmm0, %xmm3 #xmm3 = 1/6 * r
+ mulsd %xmm1, %xmm0 #xmm0 = r * r
+ addsd .L__real_1_by_2(%rip), %xmm3 #xmm3 = 1/2 + (1/6 * r)
+ mulsd %xmm3, %xmm0 #xmm0 = r*r*(1/2 + (1/6 * r))
+ addsd %xmm1, %xmm0 #xmm0 = r+r*r*(1/2 + (1/6 * r))
+
+ #j = n & 0x3f
+ mov $0x3f, %rax #rax = 0x3f
+ and %ecx, %eax #eax = j = n & 0x3f
+
+ # f + (f*q)
+ lea L__two_to_jby64_table(%rip), %r10
+ mulsd (%r10,%rax,8), %xmm0
+ addsd (%r10,%rax,8), %xmm0
+
+ .p2align 4
+ # m = (n - j) / 64
+ psrad $6,%xmm4
+ psllq $52,%xmm4
+ paddq %xmm0, %xmm4
+ cvtpd2ps %xmm4, %xmm0
+ ret
+
+.p2align 4
+.L__y_is_zero:
+ pxor %xmm1, %xmm1 #return value in xmm1,input in xmm0 before calling
+ mov $2, %edi #code in edi
+ #call fname_special
+ pxor %xmm0,%xmm0#remove this if calling fname special
+ ret
+
+.p2align 4
+.L__y_is_inf:
+ mov $0x7f800000,%edx
+ movd %edx, %xmm1
+ mov $3, %edi
+ #call fname_special
+ movdqa %xmm1,%xmm0#remove this if calling fname special
+ ret
+
+.p2align 4
+.L__y_is_nan:
+ movaps %xmm0,%xmm1
+ addss %xmm1,%xmm1
+ mov $1, %edi
+ #call fname_special
+ movdqa %xmm1,%xmm0 #remove this if calling fname special
+ ret
+
+.data
+.align 16
+.L__max_exp_arg: .long 0x421A209B
+.L__min_exp_arg: .long 0xC23369F4
+.L__real_64_by_log10of2: .quad 0x406A934F0979A371 # 64/log10(2)
+.L__real_log10of2_by_64: .quad 0x3F734413509F79FF # log10of2_by_64
+.L__real_ln10: .quad 0x40026BB1BBB55516 # ln(10)
+.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6
+.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2
+
+.align 16
+.type L__two_to_jby64_table, @object
+.size L__two_to_jby64_table, 512
+L__two_to_jby64_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff02c9a3e778061
+ .quad 0x3ff059b0d3158574
+ .quad 0x3ff0874518759bc8
+ .quad 0x3ff0b5586cf9890f
+ .quad 0x3ff0e3ec32d3d1a2
+ .quad 0x3ff11301d0125b51
+ .quad 0x3ff1429aaea92de0
+ .quad 0x3ff172b83c7d517b
+ .quad 0x3ff1a35beb6fcb75
+ .quad 0x3ff1d4873168b9aa
+ .quad 0x3ff2063b88628cd6
+ .quad 0x3ff2387a6e756238
+ .quad 0x3ff26b4565e27cdd
+ .quad 0x3ff29e9df51fdee1
+ .quad 0x3ff2d285a6e4030b
+ .quad 0x3ff306fe0a31b715
+ .quad 0x3ff33c08b26416ff
+ .quad 0x3ff371a7373aa9cb
+ .quad 0x3ff3a7db34e59ff7
+ .quad 0x3ff3dea64c123422
+ .quad 0x3ff4160a21f72e2a
+ .quad 0x3ff44e086061892d
+ .quad 0x3ff486a2b5c13cd0
+ .quad 0x3ff4bfdad5362a27
+ .quad 0x3ff4f9b2769d2ca7
+ .quad 0x3ff5342b569d4f82
+ .quad 0x3ff56f4736b527da
+ .quad 0x3ff5ab07dd485429
+ .quad 0x3ff5e76f15ad2148
+ .quad 0x3ff6247eb03a5585
+ .quad 0x3ff6623882552225
+ .quad 0x3ff6a09e667f3bcd
+ .quad 0x3ff6dfb23c651a2f
+ .quad 0x3ff71f75e8ec5f74
+ .quad 0x3ff75feb564267c9
+ .quad 0x3ff7a11473eb0187
+ .quad 0x3ff7e2f336cf4e62
+ .quad 0x3ff82589994cce13
+ .quad 0x3ff868d99b4492ed
+ .quad 0x3ff8ace5422aa0db
+ .quad 0x3ff8f1ae99157736
+ .quad 0x3ff93737b0cdc5e5
+ .quad 0x3ff97d829fde4e50
+ .quad 0x3ff9c49182a3f090
+ .quad 0x3ffa0c667b5de565
+ .quad 0x3ffa5503b23e255d
+ .quad 0x3ffa9e6b5579fdbf
+ .quad 0x3ffae89f995ad3ad
+ .quad 0x3ffb33a2b84f15fb
+ .quad 0x3ffb7f76f2fb5e47
+ .quad 0x3ffbcc1e904bc1d2
+ .quad 0x3ffc199bdd85529c
+ .quad 0x3ffc67f12e57d14b
+ .quad 0x3ffcb720dcef9069
+ .quad 0x3ffd072d4a07897c
+ .quad 0x3ffd5818dcfba487
+ .quad 0x3ffda9e603db3285
+ .quad 0x3ffdfc97337b9b5f
+ .quad 0x3ffe502ee78b3ff6
+ .quad 0x3ffea4afa2a490da
+ .quad 0x3ffefa1bee615a27
+ .quad 0x3fff50765b6e4540
+ .quad 0x3fffa7c1819e90d8
+
+
diff --git a/src/gas/exp2.S b/src/gas/exp2.S
new file mode 100644
index 0000000..8e556d4
--- /dev/null
+++ b/src/gas/exp2.S
@@ -0,0 +1,355 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(exp2)
+#define fname_special _exp2_special@PLT
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+ ucomisd .L__max_exp2_arg(%rip), %xmm0
+ ja .L__y_is_inf
+ jp .L__y_is_nan
+ ucomisd .L__min_exp2_arg(%rip), %xmm0
+ jbe .L__y_is_zero
+
+ # x * (64)
+ movapd %xmm0,%xmm2
+ mulsd .L__real_64(%rip), %xmm2
+
+ # n = int( x * (64))
+ cvttpd2dq %xmm2, %xmm1 #xmm1 = (int)n
+ cvtdq2pd %xmm1, %xmm2 #xmm2 = (double)n
+ movd %xmm1, %ecx
+
+ # r = x - n * 1/64
+ #r *= ln2;
+ mulsd .L__one_by_64(%rip),%xmm2
+ addsd %xmm0,%xmm2 #xmm2 = r
+ mulsd .L__ln_2(%rip),%xmm2
+
+ #j = n & 0x3f
+ mov $0x3f, %rax
+ and %ecx, %eax #eax = j
+ # m = (n - j) / 64
+ sar $6, %ecx #ecx = m
+
+ # q = r + r^2*1/2 + r^3*1/6 + r^4 *1/24 + r^5*1/120 + r^6*1/720
+ # q = r + r*r*(1/2 + r*(1/6+ r*(1/24 + r*(1/120 + r*(1/720)))))
+ movapd .L__real_1_by_720(%rip), %xmm3 #xmm3 = 1/720
+ mulsd %xmm2, %xmm3 #xmm3 = r*1/720
+ movapd .L__real_1_by_6(%rip), %xmm0 #xmm0 = 1/6
+ movapd %xmm2, %xmm1 #xmm1 = r
+ mulsd %xmm2, %xmm0 #xmm0 = r*1/6
+ addsd .L__real_1_by_120(%rip), %xmm3 #xmm3 = 1/120 + (r*1/720)
+ mulsd %xmm2, %xmm1 #xmm1 = r*r
+ addsd .L__real_1_by_2(%rip), %xmm0 #xmm0 = 1/2 + (r*1/6)
+ movapd %xmm1, %xmm4 #xmm4 = r*r
+ mulsd %xmm1, %xmm4 #xmm4 = (r*r) * (r*r)
+ mulsd %xmm2, %xmm3 #xmm3 = r * (1/120 + (r*1/720))
+ mulsd %xmm1, %xmm0 #xmm0 = (r*r)*(1/2 + (r*1/6))
+ addsd .L__real_1_by_24(%rip), %xmm3 #xmm3 = 1/24 + (r * (1/120 + (r*1/720)))
+ addsd %xmm2, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6)))
+ mulsd %xmm4, %xmm3 #xmm3 = ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+ addsd %xmm3, %xmm0 #xmm0 = r + ((r*r)*(1/2 + (r*1/6))) + ((r*r) * (r*r)) * (1/24 + (r * (1/120 + (r*1/720))))
+
+ # (f)*(q) + f2 + f1
+ cmp $0xfffffc02, %ecx # -1022
+ lea .L__two_to_jby64_table(%rip), %rdx
+ lea .L__two_to_jby64_tail_table(%rip), %r11
+ lea .L__two_to_jby64_head_table(%rip), %r10
+ mulsd (%rdx,%rax,8), %xmm0
+ addsd (%r11,%rax,8), %xmm0
+ addsd (%r10,%rax,8), %xmm0
+
+ jle .L__process_denormal
+.L__process_normal:
+ shl $52, %rcx
+ movd %rcx,%xmm2
+ paddq %xmm2, %xmm0
+ ret
+
+.p2align 4
+.L__process_denormal:
+ jl .L__process_true_denormal
+ ucomisd .L__real_one(%rip), %xmm0
+ jae .L__process_normal
+.L__process_true_denormal:
+ # here ( e^r < 1 and m = -1022 ) or m <= -1023
+ add $1074, %ecx
+ mov $1, %rax
+ shl %cl, %rax
+ movd %rax, %xmm2
+ mulsd %xmm2, %xmm0
+ ret
+
+.p2align 4
+.L__y_is_inf:
+ mov $0x7ff0000000000000,%rax
+ movd %rax, %xmm1
+ mov $3, %edi
+ #call fname_special
+ movdqa %xmm1,%xmm0 #remove this if call is made
+ ret
+
+.p2align 4
+.L__y_is_nan:
+ movapd %xmm0,%xmm1
+ addsd %xmm0,%xmm1
+ mov $1, %edi
+ #call fname_special
+ movdqa %xmm1,%xmm0 #remove this if call is made
+ ret
+
+.p2align 4
+.L__y_is_zero:
+ pxor %xmm1,%xmm1
+ mov $2, %edi
+ #call fname_special
+ movdqa %xmm1,%xmm0 #remove this if call is made
+ ret
+
+.data
+.align 16
+.L__max_exp2_arg: .quad 0x4090000000000000
+.L__min_exp2_arg: .quad 0xc090c80000000000
+.L__real_64: .quad 0x4050000000000000 # 64
+.L__ln_2: .quad 0x3FE62E42FEFA39EF
+.L__one_by_64: .quad 0xbF90000000000000
+
+.align 16
+.L__real_1_by_720: .quad 0x3f56c16c16c16c17 # 1/720
+.L__real_1_by_120: .quad 0x3f81111111111111 # 1/120
+.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6
+.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2
+.L__real_1_by_24: .quad 0x3fa5555555555555 # 1/24
+.L__real_one: .quad 0x3ff0000000000000
+
+.align 16
+.L__two_to_jby64_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff02c9a3e778061
+ .quad 0x3ff059b0d3158574
+ .quad 0x3ff0874518759bc8
+ .quad 0x3ff0b5586cf9890f
+ .quad 0x3ff0e3ec32d3d1a2
+ .quad 0x3ff11301d0125b51
+ .quad 0x3ff1429aaea92de0
+ .quad 0x3ff172b83c7d517b
+ .quad 0x3ff1a35beb6fcb75
+ .quad 0x3ff1d4873168b9aa
+ .quad 0x3ff2063b88628cd6
+ .quad 0x3ff2387a6e756238
+ .quad 0x3ff26b4565e27cdd
+ .quad 0x3ff29e9df51fdee1
+ .quad 0x3ff2d285a6e4030b
+ .quad 0x3ff306fe0a31b715
+ .quad 0x3ff33c08b26416ff
+ .quad 0x3ff371a7373aa9cb
+ .quad 0x3ff3a7db34e59ff7
+ .quad 0x3ff3dea64c123422
+ .quad 0x3ff4160a21f72e2a
+ .quad 0x3ff44e086061892d
+ .quad 0x3ff486a2b5c13cd0
+ .quad 0x3ff4bfdad5362a27
+ .quad 0x3ff4f9b2769d2ca7
+ .quad 0x3ff5342b569d4f82
+ .quad 0x3ff56f4736b527da
+ .quad 0x3ff5ab07dd485429
+ .quad 0x3ff5e76f15ad2148
+ .quad 0x3ff6247eb03a5585
+ .quad 0x3ff6623882552225
+ .quad 0x3ff6a09e667f3bcd
+ .quad 0x3ff6dfb23c651a2f
+ .quad 0x3ff71f75e8ec5f74
+ .quad 0x3ff75feb564267c9
+ .quad 0x3ff7a11473eb0187
+ .quad 0x3ff7e2f336cf4e62
+ .quad 0x3ff82589994cce13
+ .quad 0x3ff868d99b4492ed
+ .quad 0x3ff8ace5422aa0db
+ .quad 0x3ff8f1ae99157736
+ .quad 0x3ff93737b0cdc5e5
+ .quad 0x3ff97d829fde4e50
+ .quad 0x3ff9c49182a3f090
+ .quad 0x3ffa0c667b5de565
+ .quad 0x3ffa5503b23e255d
+ .quad 0x3ffa9e6b5579fdbf
+ .quad 0x3ffae89f995ad3ad
+ .quad 0x3ffb33a2b84f15fb
+ .quad 0x3ffb7f76f2fb5e47
+ .quad 0x3ffbcc1e904bc1d2
+ .quad 0x3ffc199bdd85529c
+ .quad 0x3ffc67f12e57d14b
+ .quad 0x3ffcb720dcef9069
+ .quad 0x3ffd072d4a07897c
+ .quad 0x3ffd5818dcfba487
+ .quad 0x3ffda9e603db3285
+ .quad 0x3ffdfc97337b9b5f
+ .quad 0x3ffe502ee78b3ff6
+ .quad 0x3ffea4afa2a490da
+ .quad 0x3ffefa1bee615a27
+ .quad 0x3fff50765b6e4540
+ .quad 0x3fffa7c1819e90d8
+
+.align 16
+.L__two_to_jby64_head_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff02c9a30000000
+ .quad 0x3ff059b0d0000000
+ .quad 0x3ff0874510000000
+ .quad 0x3ff0b55860000000
+ .quad 0x3ff0e3ec30000000
+ .quad 0x3ff11301d0000000
+ .quad 0x3ff1429aa0000000
+ .quad 0x3ff172b830000000
+ .quad 0x3ff1a35be0000000
+ .quad 0x3ff1d48730000000
+ .quad 0x3ff2063b80000000
+ .quad 0x3ff2387a60000000
+ .quad 0x3ff26b4560000000
+ .quad 0x3ff29e9df0000000
+ .quad 0x3ff2d285a0000000
+ .quad 0x3ff306fe00000000
+ .quad 0x3ff33c08b0000000
+ .quad 0x3ff371a730000000
+ .quad 0x3ff3a7db30000000
+ .quad 0x3ff3dea640000000
+ .quad 0x3ff4160a20000000
+ .quad 0x3ff44e0860000000
+ .quad 0x3ff486a2b0000000
+ .quad 0x3ff4bfdad0000000
+ .quad 0x3ff4f9b270000000
+ .quad 0x3ff5342b50000000
+ .quad 0x3ff56f4730000000
+ .quad 0x3ff5ab07d0000000
+ .quad 0x3ff5e76f10000000
+ .quad 0x3ff6247eb0000000
+ .quad 0x3ff6623880000000
+ .quad 0x3ff6a09e60000000
+ .quad 0x3ff6dfb230000000
+ .quad 0x3ff71f75e0000000
+ .quad 0x3ff75feb50000000
+ .quad 0x3ff7a11470000000
+ .quad 0x3ff7e2f330000000
+ .quad 0x3ff8258990000000
+ .quad 0x3ff868d990000000
+ .quad 0x3ff8ace540000000
+ .quad 0x3ff8f1ae90000000
+ .quad 0x3ff93737b0000000
+ .quad 0x3ff97d8290000000
+ .quad 0x3ff9c49180000000
+ .quad 0x3ffa0c6670000000
+ .quad 0x3ffa5503b0000000
+ .quad 0x3ffa9e6b50000000
+ .quad 0x3ffae89f90000000
+ .quad 0x3ffb33a2b0000000
+ .quad 0x3ffb7f76f0000000
+ .quad 0x3ffbcc1e90000000
+ .quad 0x3ffc199bd0000000
+ .quad 0x3ffc67f120000000
+ .quad 0x3ffcb720d0000000
+ .quad 0x3ffd072d40000000
+ .quad 0x3ffd5818d0000000
+ .quad 0x3ffda9e600000000
+ .quad 0x3ffdfc9730000000
+ .quad 0x3ffe502ee0000000
+ .quad 0x3ffea4afa0000000
+ .quad 0x3ffefa1be0000000
+ .quad 0x3fff507650000000
+ .quad 0x3fffa7c180000000
+
+.align 16
+.L__two_to_jby64_tail_table:
+ .quad 0x0000000000000000
+ .quad 0x3e6cef00c1dcdef9
+ .quad 0x3e48ac2ba1d73e2a
+ .quad 0x3e60eb37901186be
+ .quad 0x3e69f3121ec53172
+ .quad 0x3e469e8d10103a17
+ .quad 0x3df25b50a4ebbf1a
+ .quad 0x3e6d525bbf668203
+ .quad 0x3e68faa2f5b9bef9
+ .quad 0x3e66df96ea796d31
+ .quad 0x3e368b9aa7805b80
+ .quad 0x3e60c519ac771dd6
+ .quad 0x3e6ceac470cd83f5
+ .quad 0x3e5789f37495e99c
+ .quad 0x3e547f7b84b09745
+ .quad 0x3e5b900c2d002475
+ .quad 0x3e64636e2a5bd1ab
+ .quad 0x3e4320b7fa64e430
+ .quad 0x3e5ceaa72a9c5154
+ .quad 0x3e53967fdba86f24
+ .quad 0x3e682468446b6824
+ .quad 0x3e3f72e29f84325b
+ .quad 0x3e18624b40c4dbd0
+ .quad 0x3e5704f3404f068e
+ .quad 0x3e54d8a89c750e5e
+ .quad 0x3e5a74b29ab4cf62
+ .quad 0x3e5a753e077c2a0f
+ .quad 0x3e5ad49f699bb2c0
+ .quad 0x3e6a90a852b19260
+ .quad 0x3e56b48521ba6f93
+ .quad 0x3e0d2ac258f87d03
+ .quad 0x3e42a91124893ecf
+ .quad 0x3e59fcef32422cbe
+ .quad 0x3e68ca345de441c5
+ .quad 0x3e61d8bee7ba46e1
+ .quad 0x3e59099f22fdba6a
+ .quad 0x3e4f580c36bea881
+ .quad 0x3e5b3d398841740a
+ .quad 0x3e62999c25159f11
+ .quad 0x3e668925d901c83b
+ .quad 0x3e415506dadd3e2a
+ .quad 0x3e622aee6c57304e
+ .quad 0x3e29b8bc9e8a0387
+ .quad 0x3e6fbc9c9f173d24
+ .quad 0x3e451f8480e3e235
+ .quad 0x3e66bbcac96535b5
+ .quad 0x3e41f12ae45a1224
+ .quad 0x3e55e7f6fd0fac90
+ .quad 0x3e62b5a75abd0e69
+ .quad 0x3e609e2bf5ed7fa1
+ .quad 0x3e47daf237553d84
+ .quad 0x3e12f074891ee83d
+ .quad 0x3e6b0aa538444196
+ .quad 0x3e6cafa29694426f
+ .quad 0x3e69df20d22a0797
+ .quad 0x3e640f12f71a1e45
+ .quad 0x3e69f7490e4bb40b
+ .quad 0x3e4ed9942b84600d
+ .quad 0x3e4bdcdaf5cb4656
+ .quad 0x3e5e2cffd89cf44c
+ .quad 0x3e452486cc2c7b9d
+ .quad 0x3e6cc2b44eee3fa4
+ .quad 0x3e66dc8a80ce9f09
+ .quad 0x3e39e90d82e90a7e
+
+
diff --git a/src/gas/exp2f.S b/src/gas/exp2f.S
new file mode 100644
index 0000000..78c50e0
--- /dev/null
+++ b/src/gas/exp2f.S
@@ -0,0 +1,193 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(exp2f)
+#define fname_special _exp2f_special@PLT
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+ ucomiss .L__max_exp2_arg(%rip), %xmm0
+ ja .L__y_is_inf
+ jp .L__y_is_nan
+ ucomiss .L__min_exp2_arg(%rip), %xmm0
+ jb .L__y_is_zero
+
+ cvtps2pd %xmm0, %xmm0 #xmm0 = (double)x
+
+ # x * (64)
+ movapd %xmm0,%xmm3 #xmm3 = (double)x
+ #mulsd .L__sixtyfour(%rip), %xmm3 #xmm3 = x * (64)
+ paddq .L__sixtyfour(%rip), %xmm3 #xmm3 = x * (64)
+
+ # n = int( x * (64)
+ cvtpd2dq %xmm3, %xmm4 #xmm4 = (int)n
+ cvtdq2pd %xmm4, %xmm2 #xmm2 = (double)n
+
+ # r = x - n * 1/64
+ # r *= ln(2)
+ mulsd .L__one_by_64(%rip),%xmm2 #xmm2 = n * 1/64
+ movd %xmm4, %ecx #ecx = n
+ subsd %xmm2, %xmm0 #xmm0 = r
+ mulsd .L__ln2(%rip),%xmm0 #xmm0 = r = r*ln(2)
+ movapd %xmm0, %xmm1 #xmm1 = r
+
+ # q
+ movsd .L__real_1_by_6(%rip), %xmm3
+ mulsd %xmm0, %xmm3 #xmm3 = 1/6 * r
+ mulsd %xmm1, %xmm0 #xmm0 = r * r
+ addsd .L__real_1_by_2(%rip), %xmm3 #xmm3 = 1/2 + (1/6 * r)
+ mulsd %xmm3, %xmm0 #xmm0 = r*r*(1/2 + (1/6 * r))
+ addsd %xmm1, %xmm0 #xmm0 = r+r*r*(1/2 + (1/6 * r))
+
+ #j = n & 0x3f
+ mov $0x3f, %rax #rax = 0x3f
+ and %ecx, %eax #eax = j = n & 0x3f
+
+ # f + (f*q)
+ lea L__two_to_jby64_table(%rip), %r10
+ mulsd (%r10,%rax,8), %xmm0
+ addsd (%r10,%rax,8), %xmm0
+
+ .p2align 4
+ # m = (n - j) / 64
+ psrad $6,%xmm4
+ psllq $52,%xmm4
+ paddq %xmm0, %xmm4
+ cvtpd2ps %xmm4, %xmm0
+ ret
+
+.p2align 4
+.L__y_is_zero:
+ pxor %xmm1, %xmm1 #return value in xmm1,input in xmm0 before calling
+ mov $2, %edi #code in edi
+ #call fname_special
+ pxor %xmm0,%xmm0#remove this if calling fname special
+ ret
+
+.p2align 4
+.L__y_is_inf:
+ mov $0x7f800000,%edx
+ movd %edx, %xmm1
+ mov $3, %edi
+ #call fname_special
+ movdqa %xmm1,%xmm0#remove this if calling fname special
+ ret
+
+.p2align 4
+.L__y_is_nan:
+ movaps %xmm0,%xmm1
+ addss %xmm1,%xmm1
+ mov $1, %edi
+ #call fname_special
+ movdqa %xmm1,%xmm0 #remove this if calling fname special
+ ret
+
+.data
+.align 16
+.L__max_exp2_arg: .long 0x43000000
+.L__min_exp2_arg: .long 0xc3150000
+.align 16
+.L__sixtyfour: .quad 0x0060000000000000 # 64
+.L__one_by_64: .quad 0x3F90000000000000 # 1/64
+.L__ln2: .quad 0x3FE62E42FEFA39EF # ln(2)
+.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6
+.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2
+
+.align 16
+.type L__two_to_jby64_table, @object
+.size L__two_to_jby64_table, 512
+L__two_to_jby64_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff02c9a3e778061
+ .quad 0x3ff059b0d3158574
+ .quad 0x3ff0874518759bc8
+ .quad 0x3ff0b5586cf9890f
+ .quad 0x3ff0e3ec32d3d1a2
+ .quad 0x3ff11301d0125b51
+ .quad 0x3ff1429aaea92de0
+ .quad 0x3ff172b83c7d517b
+ .quad 0x3ff1a35beb6fcb75
+ .quad 0x3ff1d4873168b9aa
+ .quad 0x3ff2063b88628cd6
+ .quad 0x3ff2387a6e756238
+ .quad 0x3ff26b4565e27cdd
+ .quad 0x3ff29e9df51fdee1
+ .quad 0x3ff2d285a6e4030b
+ .quad 0x3ff306fe0a31b715
+ .quad 0x3ff33c08b26416ff
+ .quad 0x3ff371a7373aa9cb
+ .quad 0x3ff3a7db34e59ff7
+ .quad 0x3ff3dea64c123422
+ .quad 0x3ff4160a21f72e2a
+ .quad 0x3ff44e086061892d
+ .quad 0x3ff486a2b5c13cd0
+ .quad 0x3ff4bfdad5362a27
+ .quad 0x3ff4f9b2769d2ca7
+ .quad 0x3ff5342b569d4f82
+ .quad 0x3ff56f4736b527da
+ .quad 0x3ff5ab07dd485429
+ .quad 0x3ff5e76f15ad2148
+ .quad 0x3ff6247eb03a5585
+ .quad 0x3ff6623882552225
+ .quad 0x3ff6a09e667f3bcd
+ .quad 0x3ff6dfb23c651a2f
+ .quad 0x3ff71f75e8ec5f74
+ .quad 0x3ff75feb564267c9
+ .quad 0x3ff7a11473eb0187
+ .quad 0x3ff7e2f336cf4e62
+ .quad 0x3ff82589994cce13
+ .quad 0x3ff868d99b4492ed
+ .quad 0x3ff8ace5422aa0db
+ .quad 0x3ff8f1ae99157736
+ .quad 0x3ff93737b0cdc5e5
+ .quad 0x3ff97d829fde4e50
+ .quad 0x3ff9c49182a3f090
+ .quad 0x3ffa0c667b5de565
+ .quad 0x3ffa5503b23e255d
+ .quad 0x3ffa9e6b5579fdbf
+ .quad 0x3ffae89f995ad3ad
+ .quad 0x3ffb33a2b84f15fb
+ .quad 0x3ffb7f76f2fb5e47
+ .quad 0x3ffbcc1e904bc1d2
+ .quad 0x3ffc199bdd85529c
+ .quad 0x3ffc67f12e57d14b
+ .quad 0x3ffcb720dcef9069
+ .quad 0x3ffd072d4a07897c
+ .quad 0x3ffd5818dcfba487
+ .quad 0x3ffda9e603db3285
+ .quad 0x3ffdfc97337b9b5f
+ .quad 0x3ffe502ee78b3ff6
+ .quad 0x3ffea4afa2a490da
+ .quad 0x3ffefa1bee615a27
+ .quad 0x3fff50765b6e4540
+ .quad 0x3fffa7c1819e90d8
+
+
diff --git a/src/gas/expf.S b/src/gas/expf.S
new file mode 100644
index 0000000..cefa608
--- /dev/null
+++ b/src/gas/expf.S
@@ -0,0 +1,201 @@
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+#ifdef __x86_64__
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# expf.S
+#
+# An implementation of the expf libm function.
+#
+# Prototype:
+#
+# float expf(float x);
+#
+
+#
+# Algorithm:
+# Similar to one presnted in exp.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(expf)
+#define fname_special _expf_special@PLT
+
+.text
+.p2align 4
+.globl fname
+.type fname,@function
+fname:
+ ucomiss .L__max_exp_arg(%rip), %xmm0
+ ja .L__y_is_inf
+ jp .L__y_is_nan
+ ucomiss .L__min_exp_arg(%rip), %xmm0
+ jb .L__y_is_zero
+
+ cvtps2pd %xmm0, %xmm0 #xmm0 = (double)x
+
+ # x * (64/ln(2))
+ movapd %xmm0,%xmm3 #xmm3 = (xouble)x
+ mulsd .L__real_64_by_log2(%rip), %xmm3 #xmm3 = x * (64/ln(2)
+
+ # n = int( x * (64/ln(2)) )
+ cvtpd2dq %xmm3, %xmm4 #xmm4 = (int)n
+ cvtdq2pd %xmm4, %xmm2 #xmm2 = (double)n
+
+ # r = x - n * ln(2)/64
+ mulsd .L__real_log2_by_64(%rip),%xmm2 #xmm2 = n * ln(2)/64
+ movd %xmm4, %ecx #ecx = n
+ subsd %xmm2, %xmm0 #xmm0 = r
+ movapd %xmm0, %xmm1 #xmm1 = r
+
+ # q
+ movsd .L__real_1_by_6(%rip), %xmm3
+ mulsd %xmm0, %xmm3 #xmm3 = 1/6 * r
+ mulsd %xmm1, %xmm0 #xmm0 = r * r
+ addsd .L__real_1_by_2(%rip), %xmm3 #xmm3 = 1/2 + (1/6 * r)
+ mulsd %xmm3, %xmm0 #xmm0 = r*r*(1/2 + (1/6 * r))
+ addsd %xmm1, %xmm0 #xmm0 = r+r*r*(1/2 + (1/6 * r))
+
+ #j = n & 0x3f
+ mov $0x3f, %rax #rax = 0x3f
+ and %ecx, %eax #eax = j = n & 0x3f
+ # m = (n - j) / 64
+ sar $6, %ecx #ecx = m
+ shl $52, %rcx
+
+ # (f)*(1+q)
+ lea L__two_to_jby64_table(%rip), %r10
+ movsd (%r10,%rax,8), %xmm2
+ mulsd %xmm2, %xmm0
+ addsd %xmm2, %xmm0
+
+ movd %rcx, %xmm1
+ paddq %xmm0, %xmm1
+ cvtpd2ps %xmm1, %xmm0
+ ret
+
+.p2align 4
+.L__y_is_zero:
+
+ pxor %xmm1, %xmm1 #return value in xmm1,input in xmm0 before calling
+ mov $2, %edi #code in edi
+ jmp fname_special
+
+.p2align 4
+.L__y_is_inf:
+
+ mov $0x7f800000,%edx
+ movd %edx, %xmm1
+ mov $3, %edi
+ jmp fname_special
+
+.p2align 4
+.L__y_is_nan:
+ movaps %xmm0,%xmm1
+ addss %xmm1,%xmm1
+ mov $1, %edi
+ jmp fname_special
+
+.data
+.align 16
+.L__max_exp_arg: .long 0x42B17218
+.L__min_exp_arg: .long 0xC2CE8ED0
+.L__real_64_by_log2: .quad 0x40571547652b82fe # 64/ln(2)
+.L__real_log2_by_64: .quad 0x3f862e42fefa39ef # log2_by_64
+.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6
+.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2
+
+.align 16
+.type L__two_to_jby64_table, @object
+.size L__two_to_jby64_table, 512
+L__two_to_jby64_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff02c9a3e778061
+ .quad 0x3ff059b0d3158574
+ .quad 0x3ff0874518759bc8
+ .quad 0x3ff0b5586cf9890f
+ .quad 0x3ff0e3ec32d3d1a2
+ .quad 0x3ff11301d0125b51
+ .quad 0x3ff1429aaea92de0
+ .quad 0x3ff172b83c7d517b
+ .quad 0x3ff1a35beb6fcb75
+ .quad 0x3ff1d4873168b9aa
+ .quad 0x3ff2063b88628cd6
+ .quad 0x3ff2387a6e756238
+ .quad 0x3ff26b4565e27cdd
+ .quad 0x3ff29e9df51fdee1
+ .quad 0x3ff2d285a6e4030b
+ .quad 0x3ff306fe0a31b715
+ .quad 0x3ff33c08b26416ff
+ .quad 0x3ff371a7373aa9cb
+ .quad 0x3ff3a7db34e59ff7
+ .quad 0x3ff3dea64c123422
+ .quad 0x3ff4160a21f72e2a
+ .quad 0x3ff44e086061892d
+ .quad 0x3ff486a2b5c13cd0
+ .quad 0x3ff4bfdad5362a27
+ .quad 0x3ff4f9b2769d2ca7
+ .quad 0x3ff5342b569d4f82
+ .quad 0x3ff56f4736b527da
+ .quad 0x3ff5ab07dd485429
+ .quad 0x3ff5e76f15ad2148
+ .quad 0x3ff6247eb03a5585
+ .quad 0x3ff6623882552225
+ .quad 0x3ff6a09e667f3bcd
+ .quad 0x3ff6dfb23c651a2f
+ .quad 0x3ff71f75e8ec5f74
+ .quad 0x3ff75feb564267c9
+ .quad 0x3ff7a11473eb0187
+ .quad 0x3ff7e2f336cf4e62
+ .quad 0x3ff82589994cce13
+ .quad 0x3ff868d99b4492ed
+ .quad 0x3ff8ace5422aa0db
+ .quad 0x3ff8f1ae99157736
+ .quad 0x3ff93737b0cdc5e5
+ .quad 0x3ff97d829fde4e50
+ .quad 0x3ff9c49182a3f090
+ .quad 0x3ffa0c667b5de565
+ .quad 0x3ffa5503b23e255d
+ .quad 0x3ffa9e6b5579fdbf
+ .quad 0x3ffae89f995ad3ad
+ .quad 0x3ffb33a2b84f15fb
+ .quad 0x3ffb7f76f2fb5e47
+ .quad 0x3ffbcc1e904bc1d2
+ .quad 0x3ffc199bdd85529c
+ .quad 0x3ffc67f12e57d14b
+ .quad 0x3ffcb720dcef9069
+ .quad 0x3ffd072d4a07897c
+ .quad 0x3ffd5818dcfba487
+ .quad 0x3ffda9e603db3285
+ .quad 0x3ffdfc97337b9b5f
+ .quad 0x3ffe502ee78b3ff6
+ .quad 0x3ffea4afa2a490da
+ .quad 0x3ffefa1bee615a27
+ .quad 0x3fff50765b6e4540
+ .quad 0x3fffa7c1819e90d8
+
+
+#endif
diff --git a/src/gas/expm1.S b/src/gas/expm1.S
new file mode 100644
index 0000000..dff043c
--- /dev/null
+++ b/src/gas/expm1.S
@@ -0,0 +1,359 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(expm1)
+
+#ifdef __ELF__
+ .section .note.GNU-stack,"",@progbits
+#endif
+
+ .text
+ .p2align 4
+.globl fname
+ .type fname, @function
+
+fname:
+
+ ucomisd .L__max_expm1_arg(%rip),%xmm0 #check if(x > 709.8)
+ ja .L__Max_Arg
+ jp .L__Max_Arg
+ ucomisd .L__min_expm1_arg(%rip),%xmm0 #if(x < -37.42994775023704)
+ jb .L__Min_Arg
+ ucomisd .L__log_OneMinus_OneByFour(%rip),%xmm0
+ jbe .L__Normal_Flow
+ ucomisd .L__log_OnePlus_OneByFour(%rip),%xmm0
+ jb .L__Small_Arg
+
+ .p2align 4
+.L__Normal_Flow:
+ movapd %xmm0,%xmm1 #xmm1 = x
+ mulsd .L__thirtyTwo_by_ln2(%rip),%xmm1 #xmm1 = x*thirtyTwo_by_ln2
+ ucomisd .L__zero(%rip),%xmm1 #check if temp < 0.0
+ jae .L__Add_Point_Five
+ subsd .L__point_Five(%rip),%xmm1
+ jmp .L__next
+.L__Add_Point_Five:
+ addsd .L__point_Five(%rip),%xmm1 #xmm1 = temp +/- 0.5
+.L__next:
+ cvttpd2dq %xmm1,%xmm2 #xmm2 = (int)n
+ cvtdq2pd %xmm2,%xmm1 #xmm1 = (double)n
+ movapd %xmm2,%xmm3 #xmm3 = (int)n
+ psrad $5,%xmm2 #xmm2 = m
+ pslld $27,%xmm3
+ psrld $27,%xmm3 #xmm3 = j
+ movd %xmm3,%edx #edx = j
+ movd %xmm2,%ecx #ecx = m
+
+ movlhps %xmm1,%xmm1 #xmm1 = n,n
+ mulpd .L__Ln2By32_MinusTrailLead(%rip),%xmm1
+ movapd %xmm0,%xmm2
+ subsd %xmm1,%xmm2 #xmm2 = r1
+ psrldq $8,%xmm1 #xmm1 = r2
+ movapd %xmm2,%xmm3 #xmm3 = r1
+ addsd %xmm1,%xmm3 #xmm3 = r
+ #q = r*(r*(A1.f64 + r*(A2.f64 + r*(A3.f64 + r*(A4.f64 + r*(A5.f64))))));
+ movapd %xmm3,%xmm4
+ mulsd .L__A5(%rip),%xmm4
+ addsd .L__A4(%rip),%xmm4
+ mulsd %xmm3,%xmm4
+ addsd .L__A3(%rip),%xmm4
+ mulsd %xmm3,%xmm4
+ addsd .L__A2(%rip),%xmm4
+ mulsd %xmm3,%xmm4
+ addsd .L__A1(%rip),%xmm4
+ mulsd %xmm3,%xmm4
+ mulsd %xmm4,%xmm3 #xmm3 = q
+
+ shl $4,%edx
+ lea S_lead_and_trail_table(%rip),%rax
+ movdqa (%rax,%rdx,1),%xmm5 #xmm5 = S_T,S_L
+
+ #p = (r2+q) + r1;
+ addsd %xmm3,%xmm1
+ addsd %xmm1,%xmm2 #xmm2 = p
+
+ #s = S_L.f64 + S_T.f64;
+ movhlps %xmm5,%xmm4 #xmm4 = S_T
+ movapd %xmm4,%xmm3 #xmm3 = S_T
+ addsd %xmm5,%xmm3 #xmm3 = s
+
+ cmp $52,%ecx #check m > 52
+ jg .L__M_Above_52
+ cmp $-7,%ecx #check if m < -7
+ jl .L__M_Below_Minus7
+ #(-8 < m) && (m < 53)
+ movapd %xmm2,%xmm3 #xmm3 = p
+ addsd .L__One(%rip),%xmm3 #xmm3 = 1+p
+ mulsd %xmm4,%xmm3 #xmm3 = S_T.f64 *(1+p)
+ mulsd %xmm5,%xmm2 #xmm2 = S_L*p
+ addsd %xmm3,%xmm2 #xmm2 = (S_L.f64*p+ S_T.f64 *(1+p))
+ mov $1023,%edx
+ sub %ecx,%edx #edx = twopmm
+ shl $52,%rdx
+ movd %rdx,%xmm1 #xmm1 = twopmm
+ subsd %xmm1,%xmm5 #xmm5 = S_L.f64 - twopmm.f64
+ addsd %xmm5,%xmm2
+ shl $52,%rcx
+ movd %rcx,%xmm0 #xmm0 = twopm
+ paddq %xmm2,%xmm0 #xmm0 = twopm *(xmm2)
+ ret
+
+ .p2align 4
+.L__M_Above_52:
+ cmp $1024,%ecx #check if m = 1024
+ je .L__M_Equals_1024
+ #twopm.f64 * (S_L.f64 + (s*p+(S_T.f64 - twopmm.f64)));// 2^-m should not be calculated if m>105
+ mov $1023,%edx
+ sub %ecx,%edx #edx = twopmm
+ shl $52,%rdx
+ movd %rdx,%xmm1 #xmm1 = twopmm
+ subsd %xmm1,%xmm4 #xmm4 = S_T - twopmm
+ mulsd %xmm3,%xmm2 #xmm2 = s*p
+ addsd %xmm4,%xmm2
+ addsd %xmm5,%xmm2
+ shl $52,%rcx
+ movd %rcx,%xmm0 #xmm0 = twopm
+ paddq %xmm2,%xmm0
+ ret
+
+ .p2align 4
+.L__M_Below_Minus7:
+ #twopm.f64 * (S_L.f64 + (s*p + S_T.f64)) - 1;
+ mulsd %xmm3,%xmm2 #xmm2 = s*p
+ addsd %xmm4,%xmm2 #xmm2 = (s*p + S_T.f64)
+ addsd %xmm5,%xmm2 #xmm2 = (S_L.f64 + (s*p + S_T.f64))
+ shl $52,%rcx
+ movd %rcx,%xmm0 #xmm0 = twopm
+ paddq %xmm2,%xmm0 #xmm0 = twopm *(xmm2)
+ subsd .L__One(%rip),%xmm0
+ ret
+
+ .p2align 4
+.L__M_Equals_1024:
+ mov $0x4000000000000000,%rax #1024 at exponent
+ mulsd %xmm3,%xmm2 #xmm2 = s*p
+ addsd %xmm4,%xmm2 #xmm2 = (s*p) + S_T
+ addsd %xmm5,%xmm2 #xmm2 = S_L + ((s*p) + S_T)
+ movd %rax,%xmm1 #xmm1 = twopm
+ paddq %xmm2,%xmm1
+ movd %xmm1,%rax
+ mov $0x7FF0000000000000,%rcx
+ and %rcx,%rax
+ cmp %rcx,%rax #check if we reached inf
+ je .L__return_Inf
+ movapd %xmm1,%xmm0
+ ret
+
+ .p2align 4
+.L__Small_Arg:
+ movapd %xmm0,%xmm1
+ psllq $1,%xmm1
+ psrlq $1,%xmm1 #xmm1 = abs(x)
+ ucomisd .L__Five_Pont_FiveEMinus17(%rip),%xmm1
+ jb .L__VeryTinyArg
+ mov $0x01E0000000000000,%rax #30 in exponents place
+ #u = (twop30.f64 * x + x) - twop30.f64 * x;
+ movd %rax,%xmm1
+ paddq %xmm0,%xmm1 #xmm1 = twop30.f64 * x
+ movapd %xmm1,%xmm2
+ addsd %xmm0,%xmm2 #xmm2 = (twop30.f64 * x + x)
+ subsd %xmm1,%xmm2 #xmm2 = u
+ movapd %xmm0,%xmm1
+ subsd %xmm2,%xmm1 #xmm1 = v = x-u
+ movapd %xmm2,%xmm3 #xmm3 = u
+ mulsd %xmm2,%xmm3 #xmm3 = u*u
+ mulsd .L__point_Five(%rip),%xmm3 #xmm3 = y = u*u*0.5
+ #z = v * (x + u) * 0.5;
+ movapd %xmm0,%xmm4
+ addsd %xmm2,%xmm4
+ mulsd %xmm1,%xmm4
+ mulsd .L__point_Five(%rip),%xmm4 #xmm4 = z
+
+ #q = x*x*x*(A1.f64 + x*(A2.f64 + x*(A3.f64 + x*(A4.f64 + x*(A5.f64 + x*(A6.f64 + x*(A7.f64 + x*(A8.f64 + x*(A9.f64)))))))));
+ movapd %xmm0,%xmm5
+ mulsd .L__B9(%rip),%xmm5
+ addsd .L__B8(%rip),%xmm5
+ mulsd %xmm0,%xmm5
+ addsd .L__B7(%rip),%xmm5
+ mulsd %xmm0,%xmm5
+ addsd .L__B6(%rip),%xmm5
+ mulsd %xmm0,%xmm5
+ addsd .L__B5(%rip),%xmm5
+ mulsd %xmm0,%xmm5
+ addsd .L__B4(%rip),%xmm5
+ mulsd %xmm0,%xmm5
+ addsd .L__B3(%rip),%xmm5
+ mulsd %xmm0,%xmm5
+ addsd .L__B2(%rip),%xmm5
+ mulsd %xmm0,%xmm5
+ addsd .L__B1(%rip),%xmm5
+ mulsd %xmm0,%xmm5
+ mulsd %xmm0,%xmm5
+ mulsd %xmm0,%xmm5 #xmm5 = q
+
+ ucomisd .L__TwopM7(%rip),%xmm3
+ jb .L__returnNext
+ addsd %xmm4,%xmm1 #xmm1 = v+z
+ addsd %xmm5,%xmm1 #xmm1 = q+(v+z)
+ addsd %xmm3,%xmm2 #xmm2 = u+y
+ addsd %xmm2,%xmm1
+ movapd %xmm1,%xmm0
+ ret
+ .p2align 4
+.L__returnNext:
+ addsd %xmm5,%xmm4 #xmm4 = q +z
+ addsd %xmm4,%xmm3 #xmm3 = y+(q+z)
+ addsd %xmm3,%xmm0
+ ret
+
+ .p2align 4
+.L__VeryTinyArg:
+ #(twop100.f64 * x + xabs.f64) * twopm100.f64);
+ mov $0x0640000000000000,%rax #100 at exponent's place
+ movd %rax,%xmm2
+ paddq %xmm2,%xmm0
+ addsd %xmm1,%xmm0
+ psubq %xmm2,%xmm0
+ ret
+
+
+ .p2align 4
+.L__Max_Arg:
+ movd %xmm0,%rcx
+ mov $0x7ff0000000000000,%rax
+ cmp %rax,%rcx #x is either Nan or Inf
+ jb .L__return_Inf
+ mov $0x000fffffffffffff,%rdx #check if x is Nan
+ and %rdx,%rcx
+ jne .L__Nan
+.L__return_Inf:
+ movd %rax,%xmm0
+ #call error_handler
+ ret
+ .p2align 4
+.L__Nan:
+ addsd %xmm0,%xmm0
+ ret
+ ret
+
+ .p2align 4
+.L__Min_Arg:
+ mov $0xBFF0000000000000,%rax #return -1
+ #call error handler
+ movd %rax,%xmm0
+ ret
+
+.data
+.align 16
+.L__max_expm1_arg:
+ .quad 0x40862E6666666666
+.L__min_expm1_arg:
+ .quad 0xC042B708872320E1
+.L__log_OneMinus_OneByFour:
+ .quad 0xBFD269621134DB93
+.L__log_OnePlus_OneByFour:
+ .quad 0x3FCC8FF7C79A9A22
+.L__thirtyTwo_by_ln2:
+ .quad 0x40471547652B82FE
+.L__zero:
+ .quad 0x0000000000000000
+.L__point_Five:
+ .quad 0x3FE0000000000000
+
+.align 16
+.L__Ln2By32_MinusTrailLead:
+ .octa 0xBD8473DE6AF278ED3F962E42FEF00000
+.L__A5:
+ .quad 0x3F56C1728D739765
+.L__A4:
+ .quad 0x3F811115B7AA905E
+.L__A3:
+ .quad 0x3FA5555555545D4E
+.L__A2:
+ .quad 0x3FC5555555548F7C
+.L__A1:
+ .quad 0x3FE0000000000000
+.L__One:
+ .quad 0x3FF0000000000000
+
+.align 16
+# .type two_to_jby32_table, @object
+# .size two_to_jby32_table, 512
+S_lead_and_trail_table:
+ .octa 0x00000000000000003FF0000000000000
+ .octa 0x3D0A1D73E2A475B43FF059B0D3158540
+ .octa 0x3CEEC5317256E3083FF0B5586CF98900
+ .octa 0x3CF0A4EBBF1AED933FF11301D0125B40
+ .octa 0x3D0D6E6FBE4628763FF172B83C7D5140
+ .octa 0x3D053C02DC0144C83FF1D4873168B980
+ .octa 0x3D0C3360FD6D8E0B3FF2387A6E756200
+ .octa 0x3D009612E8AFAD123FF29E9DF51FDEC0
+ .octa 0x3CF52DE8D5A463063FF306FE0A31B700
+ .octa 0x3CE54E28AA05E8A93FF371A7373AA9C0
+ .octa 0x3D011ADA0911F09F3FF3DEA64C123400
+ .octa 0x3D068189B7A04EF83FF44E0860618900
+ .octa 0x3D038EA1CBD7F6213FF4BFDAD5362A00
+ .octa 0x3CBDF0A83C49D86A3FF5342B569D4F80
+ .octa 0x3D04AC64980A8C8F3FF5AB07DD485400
+ .octa 0x3CD2C7C3E81BF4B73FF6247EB03A5580
+ .octa 0x3CE921165F626CDD3FF6A09E667F3BC0
+ .octa 0x3D09EE91B87977853FF71F75E8EC5F40
+ .octa 0x3CDB5F54408FDB373FF7A11473EB0180
+ .octa 0x3CF28ACF88AFAB353FF82589994CCE00
+ .octa 0x3CFB5BA7C55A192D3FF8ACE5422AA0C0
+ .octa 0x3D027A280E1F92A03FF93737B0CDC5C0
+ .octa 0x3CF01C7C46B071F33FF9C49182A3F080
+ .octa 0x3CFC8B424491CAF83FFA5503B23E2540
+ .octa 0x3D06AF439A68BB993FFAE89F995AD380
+ .octa 0x3CDBAA9EC206AD4F3FFB7F76F2FB5E40
+ .octa 0x3CFC2220CB12A0923FFC199BDD855280
+ .octa 0x3D048A81E5E8F4A53FFCB720DCEF9040
+ .octa 0x3CDC976816BAD9B83FFD5818DCFBA480
+ .octa 0x3CFEB968CAC39ED33FFDFC97337B9B40
+ .octa 0x3CF9858F73A18F5E3FFEA4AFA2A490C0
+ .octa 0x3C99D3E12DD8A18B3FFF50765B6E4540
+
+.align 16
+.L__Five_Pont_FiveEMinus17:
+ .quad 0x3C90000000000000
+.L__B9:
+ .quad 0x3E5A2836AA646B96
+.L__B8:
+ .quad 0x3E928295484734EA
+.L__B7:
+ .quad 0x3EC71E14BFE3DB59
+.L__B6:
+ .quad 0x3EFA019F635825C4
+.L__B5:
+ .quad 0x3F2A01A01159DD2D
+.L__B4:
+ .quad 0x3F56C16C16CE14C6
+.L__B3:
+ .quad 0x3F8111111111A9F3
+.L__B2:
+ .quad 0x3FA55555555554B6
+.L__B1:
+ .quad 0x3FC5555555555549
+.L__TwopM7:
+ .quad 0x3F80000000000000
diff --git a/src/gas/expm1f.S b/src/gas/expm1f.S
new file mode 100644
index 0000000..6e7ca03
--- /dev/null
+++ b/src/gas/expm1f.S
@@ -0,0 +1,323 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(expm1f)
+#define fname_special _expm1f_special@PLT
+
+#ifdef __ELF__
+ .section .note.GNU-stack,"",@progbits
+#endif
+
+ .text
+ .p2align 4
+.globl fname
+ .type fname, @function
+
+fname:
+ ucomiss .L__max_expm1_arg(%rip),%xmm0 ##if(x > max_expm1_arg)
+ ja .L__Max_Arg
+ jp .L__Max_Arg
+ ucomiss .L__log_OnePlus_OneByFour(%rip),%xmm0 ##if(x < log_OnePlus_OneByFour)
+ jae .L__Normal_Flow
+ ucomiss .L__log_OneMinus_OneByFour(%rip),%xmm0 ##if(x > log_OneMinus_OneByFour)
+ ja .L__Small_Arg
+ ucomiss .L__min_expm1_arg(%rip),%xmm0 ##if(x < min_expm1_arg)
+ jb .L__Min_Arg
+
+ .p2align 4
+.L__Normal_Flow:
+ movaps %xmm0,%xmm1 #xmm1 = x
+ mulss .L__thirtyTwo_by_ln2(%rip),%xmm1 #xmm1 = x*thirtyTwo_by_ln2
+ movd %xmm1,%eax #eax = x*thirtyTwo_by_ln2
+ and $0x80000000,%eax #get the sign of x*thirtyTwo_by_ln2
+ or $0x3F000000,%eax #make +/- 0.5
+ movd %eax,%xmm2 #xmm2 = +/- 0.5
+ addss %xmm2,%xmm1 #xmm1 = (x*32/ln2) +/- 0.5
+ cvttps2dq %xmm1,%xmm2 #xmm2 = n = (int)(temp)
+ mov $0x0000001f,%edx
+ movd %edx,%xmm1
+ andps %xmm2,%xmm1 #xmm1 = j
+ movd %xmm2,%ecx #ecx = n
+ sarl $5, %ecx #ecx = m = n >> 5
+ #xor %rdx,%rdx #make it zeros, to be used for address
+ movd %xmm1,%edx #edx = j
+ lea S_lead_and_trail_table(%rip),%rax
+ movsd (%rax,%rdx,8),%xmm3 #xmm3 = S_T,S_L
+ punpckldq %xmm2,%xmm1 #xmm1 = n,j
+ psubd %xmm1,%xmm2 #xmm2 = n1
+ punpcklqdq %xmm2,%xmm1 #xmm1 = n1,n,j
+ cvtdq2ps %xmm1,%xmm1 #xmm1 = (float)(n1,n,j)
+
+ #r2 = -(n*ln2_by_ThirtyTwo_trail);
+ #r1 = (x-n1*ln2_by_ThirtyTwo_lead) - j*ln2_by_ThirtyTwo_lead;
+ mulps .L__Ln2By32_LeadTrailLead(%rip),%xmm1
+ movhlps %xmm1,%xmm2 #xmm2 = n1*ln2/32lead
+ movaps %xmm0,%xmm4 #xmm4 = x
+ subss %xmm2,%xmm4 #xmm4 = x - n1*ln2/32lead
+ subss %xmm1,%xmm4 #xmm4 = r1
+ psrldq $4,%xmm1 #xmm1 = -r2 should take care of sign later
+
+ #r = r1 + r2;
+ movaps %xmm4,%xmm7 #xmm7 = r1
+ subss %xmm1,%xmm4 #xmm4 = r = r1-(-r2) = r1 + r2
+
+ #q = r*r*(B1+r*(B2));
+ movaps %xmm4,%xmm6 #xmm6 = r
+ mulss .L__B2_f(%rip),%xmm6 #xmm6 = r * B2
+ addss .L__B1_f(%rip),%xmm6 #xmm6 = B1 + (r * B2)
+ mulss %xmm4,%xmm6
+ mulss %xmm4,%xmm6 #xmm6 = q
+
+ #p = (r2+q) + r1;
+ subss %xmm1,%xmm6
+ addss %xmm7,%xmm6 #xmm6 = p
+
+ #s = S_L.f32 + S_T.f32;
+ movdqa %xmm3,%xmm2 #xmm2 = S_T,S_L
+ psrldq $4,%xmm2 #xmm2 = S_T
+ movaps %xmm2,%xmm5 #xmm5 = S_T
+ addss %xmm3,%xmm2 #xmm2 = s
+
+ cmp $0xfffffff9,%ecx #Check m < -7
+ jl .L__M_Below_Minus7
+ cmp $23,%ecx #Check m > 23
+ jg .L__M_Above_23
+ # -8 < m < 24
+ #twopm.f32 * ((S_L.f32 - twopmm.f32) + (S_L.f32*p+ S_T.f32 *(1+p)));
+ movaps %xmm3,%xmm2 #xmm2 = S_L
+ mulss %xmm6,%xmm2 #xmm2 = S_L * p
+ addss .L__One_f(%rip),%xmm6 #xmm6 = 1+p
+ mulss %xmm5,%xmm6 #xmm6 = S_T *(1+p)
+ addss %xmm6,%xmm2 #xmm2 = (S_L.f32*p+ S_T.f32 *(1+p))
+ mov $127,%eax
+ sub %ecx,%eax #eax = 127 - m
+ shl $23,%eax #eax = 2^-m
+ movd %eax,%xmm1
+ subss %xmm1,%xmm3 #xmm3 = (S_L.f32 - twopmm.f32)
+ addss %xmm3,%xmm2 #xmm2 = ((S_L.f32 - twopmm.f32) + (S_L.f32*p+ S_T.f32 *(1+p)))
+ shl $23,%ecx
+ movd %ecx,%xmm0
+ paddd %xmm2,%xmm0
+ ret
+
+ .p2align 4
+.L__M_Below_Minus7:
+ #twopm.f32 * (S_L.f32 + (s*p + S_T.f32)) - 1;
+ mulss %xmm6,%xmm2 #xmm2 = s*p
+ addss %xmm5,%xmm2 #xmm2 = s*p + S_T
+ addss %xmm3,%xmm2 #xmm2 = (S_L.f32 + (s*p + S_T.f32))
+ shl $23,%ecx
+ movd %ecx,%xmm0
+ paddd %xmm2,%xmm0
+ subss .L__One_f(%rip),%xmm0
+ ret
+
+ .p2align 4
+.L__M_Above_23:
+ #twopm.f32 * (S_L.f32 + (s*p+(S_T.f32 - twopmm.f32)));
+ cmp $0x00000080,%ecx #Check m < 128
+ je .L__M_Equals_128
+ cmp $47,%ecx #Check m > 47
+ ja .L__M_Above_47
+ mov $127,%eax
+ sub %ecx,%eax #eax = 127 - m
+ shl $23,%eax #eax = 2^-m
+ movd %eax,%xmm1
+ subss %xmm1,%xmm5 #xmm5 = S_T.f32 - twopmm.f32
+
+ .p2align 4
+.L__M_Above_47:
+ shl $23,%ecx
+ mulss %xmm6,%xmm2 #xmm2 = s*p
+ addss %xmm5,%xmm2
+ addss %xmm3,%xmm2
+ movd %ecx,%xmm0
+ paddd %xmm2,%xmm0
+ ret
+
+ .p2align 4
+.L__M_Equals_128:
+ mov $0x3f800000,%ecx #127 at exponent
+ mulss %xmm6,%xmm2 #xmm2 = s*p
+ addss %xmm5,%xmm2 #xmm2 = s*p + S_T
+ addss %xmm3,%xmm2 #xmm2 = (S_L.f32 + (s*p + S_T.f32))
+ movd %ecx,%xmm1 #127
+ paddd %xmm2,%xmm1 #2^127*(S_L.f32 + (s*p + S_T.f32))
+ mov $0x00800000,%ecx #multiply with one more 2
+ movd %ecx,%xmm2
+ paddd %xmm2,%xmm1
+ movd %xmm1,%ecx
+ and $0x7f800000,%ecx #check if we reached +inf
+ cmp $0x7f800000,%ecx
+ je .L__Overflow
+ movdqa %xmm1,%xmm0
+ ret
+
+ .p2align 4
+.L__Small_Arg:
+ movd %xmm0,%eax
+ and $0x7fffffff,%eax #eax = abs(x)
+ cmp $0x33000000,%eax #check abs(x) < 2^-25
+ jl .L__VeryTiny_Arg
+ #log(1-1/4) < x < log(1+1/4)
+ #q = x*x*x*(A1 + x*(A2 + x*(A3 + x*(A4 + x*(A5)))));
+ movdqa %xmm0,%xmm1
+ mulss .L__A5_f(%rip),%xmm1
+ addss .L__A4_f(%rip),%xmm1
+ mulss %xmm0,%xmm1
+ addss .L__A3_f(%rip),%xmm1
+ mulss %xmm0,%xmm1
+ addss .L__A2_f(%rip),%xmm1
+ mulss %xmm0,%xmm1
+ addss .L__A1_f(%rip),%xmm1
+ mulss %xmm0,%xmm1
+ mulss %xmm0,%xmm1
+ mulss %xmm0,%xmm1
+ cvtps2pd %xmm0,%xmm2
+ movdqa %xmm2,%xmm0
+ mulsd %xmm0,%xmm2
+ mulsd .L__PointFive(%rip),%xmm2
+ addsd %xmm2,%xmm0
+ cvtps2pd %xmm1,%xmm2
+ addsd %xmm0,%xmm2
+ cvtpd2ps %xmm2,%xmm0
+ ret
+
+ .p2align 4
+.L__Min_Arg:
+ mov $0xBF800000,%eax
+ #call handle_error
+ movd %eax,%xmm0
+ ret
+
+ .p2align 4
+.L__Max_Arg:
+ movd %xmm0,%eax
+ and $0x7fffffff,%eax #eax = abs(x)
+ cmp $0x7f800000,%eax #check for Nan
+ jae .L__Nan
+.L__Overflow:
+ mov $0x7f800000,%eax
+ #call handle_error
+ movd %eax,%xmm0
+ ret
+.L__Nan:
+ and $0x007fffff,%eax
+ je .L__Overflow
+ addss %xmm0,%xmm0
+ ret
+
+ .p2align 4
+.L__VeryTiny_Arg:
+ #((twopm.f32 * x + xabs.f32) * twopmm.f32);
+ movd %eax, %xmm1 #xmm1 = abs(x)
+ mov $0x32000000, %eax #100 at exponent's place
+ movd %eax, %xmm2
+ paddd %xmm2, %xmm0
+ addss %xmm1, %xmm0
+ psubd %xmm2, %xmm0
+ ret
+
+.data
+.align 16
+.type S_lead_and_trail_table, @object
+.size S_lead_and_trail_table, 256
+S_lead_and_trail_table:
+ .quad 0x000000003F800000
+ .quad 0x355315853F82CD80
+ .quad 0x34D9F3123F85AAC0
+ .quad 0x35E8092E3F889800
+ .quad 0x3471F5463F8B95C0
+ .quad 0x36E62D173F8EA400
+ .quad 0x361B9D593F91C3C0
+ .quad 0x36BEA3FC3F94F4C0
+ .quad 0x36C146373F9837C0
+ .quad 0x36E6E7553F9B8D00
+ .quad 0x36C982473F9EF500
+ .quad 0x34C0C3123FA27040
+ .quad 0x36354D8B3FA5FEC0
+ .quad 0x3655A7543FA9A140
+ .quad 0x36FBA90B3FAD5800
+ .quad 0x36D6074B3FB123C0
+ .quad 0x36CCCFE73FB504C0
+ .quad 0x36BD1D8C3FB8FB80
+ .quad 0x368E7D603FBD0880
+ .quad 0x35CCA6673FC12C40
+ .quad 0x36A845543FC56700
+ .quad 0x36F619B93FC9B980
+ .quad 0x35C151F83FCE2480
+ .quad 0x366C8F893FD2A800
+ .quad 0x36F32B5A3FD744C0
+ .quad 0x36DE5F6C3FDBFB80
+ .quad 0x367761553FE0Ccc0
+ .quad 0x355CEF903FE5B900
+ .quad 0x355CFBA53FEAC0c0
+ .quad 0x36E66F733FEFE480
+ .quad 0x36F454923FF52540
+ .quad 0x36CB6DC93FFA8380
+
+.align 16
+.L__Ln2By32_LeadTrailLead:
+ .octa 0x333FBE8E3CB17200333FBE8E3CB17200
+
+.L__max_expm1_arg:
+ .long 0x42B19999
+.L__log_OnePlus_OneByFour:
+ .long 0x3E647FBF
+
+.L__log_OneMinus_OneByFour:
+ .long 0xBE934B11
+
+.L__min_expm1_arg:
+ .long 0xC18AA122
+
+.L__thirtyTwo_by_ln2:
+ .long 0x4238AA3B
+
+.align 16
+.L__B2_f:
+ .long 0x3E2AAAEC
+.L__B1_f:
+ .long 0x3F000044
+.L__One_f:
+ .long 0x3F800000
+.L__PointFive:
+ .quad 0x3FE0000000000000
+
+.align 16
+.L__A1_f:
+ .long 0x3E2AAAAA
+.L__A2_f:
+ .long 0x3D2AAAA0
+.L__A3_f:
+ .long 0x3C0889FF
+.L__A4_f:
+ .long 0x3AB64DE5
+.L__A5_f:
+ .long 0x394AB327
+
+
+
+
+
diff --git a/src/gas/fabs.S b/src/gas/fabs.S
new file mode 100644
index 0000000..a436d0f
--- /dev/null
+++ b/src/gas/fabs.S
@@ -0,0 +1,63 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# fabs.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+# double fabs(double x);
+#
+
+#
+# Algorithm: AND the Most Significant Bit of the
+# double precision number with 0 to get the
+# floating point absolute.
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fabs)
+#define fname_special _fabs_special
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+ #input is in xmm0, which contains the final result also.
+ andpd .L__fabs_and_mask(%rip), %xmm0 # <result> latency = 3
+ ret
+
+
+.align 16
+.L__fabs_and_mask: .quad 0x7FFFFFFFFFFFFFFF
+ .quad 0x0
+
+
diff --git a/src/gas/fabsf.S b/src/gas/fabsf.S
new file mode 100644
index 0000000..8a6ea27
--- /dev/null
+++ b/src/gas/fabsf.S
@@ -0,0 +1,67 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# fabsf.S
+#
+# An implementation of the fabsf libm function.
+#
+# Prototype:
+#
+# float fabsf(float x);
+#
+
+#
+# Algorithm: AND the Most Significant Bit of the
+# single precision number with 0 to get the
+# floating point absolute.
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fabsf)
+#define fname_special _fabsf_special
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+ #input is in xmm0, which contains the final result also.
+ andps .L__fabsf_and_mask(%rip), %xmm0 # <result> latency = 3
+ ret
+
+
+.align 16
+.L__fabsf_and_mask: .long 0x7FFFFFFF
+ .long 0x0
+ .quad 0x0
+
+
+
+
+
diff --git a/src/gas/fdim.S b/src/gas/fdim.S
new file mode 100644
index 0000000..14e382f
--- /dev/null
+++ b/src/gas/fdim.S
@@ -0,0 +1,63 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fdim.S
+#
+# An implementation of the fdim libm function.
+#
+# The fdim functions determine the positive difference between their arguments
+#
+# x - y if x > y
+# +0 if x <= y
+#
+#
+#
+# Prototype:
+#
+# double fdim(double x, double y)
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fdim)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ MOVAPD %xmm0,%xmm2
+ SUBSD %xmm1,%xmm0
+ CMPNLESD %xmm1,%xmm2
+ ANDPD %xmm2,%xmm0
+
+ ret
diff --git a/src/gas/fdimf.S b/src/gas/fdimf.S
new file mode 100644
index 0000000..0b7a966
--- /dev/null
+++ b/src/gas/fdimf.S
@@ -0,0 +1,61 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fdimf.S
+#
+# An implementation of the fdimf libm function.
+#
+# The fdim functions determine the positive difference between their arguments
+#
+# x - y if x > y
+# +0 if x <= y
+#
+# Prototype:
+#
+# float fdimf(float x, float y)
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fdimf)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ MOVAPD %xmm0,%xmm2
+ SUBSS %xmm1,%xmm0
+ CMPNLESS %xmm1,%xmm2
+ ANDPS %xmm2,%xmm0
+
+ ret
diff --git a/src/gas/fmax.S b/src/gas/fmax.S
new file mode 100644
index 0000000..ec0d787
--- /dev/null
+++ b/src/gas/fmax.S
@@ -0,0 +1,66 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fmax.S
+#
+# An implementation of the fmax libm function.
+#
+# The fmax functions determine the maximum numeric value of their arguments.
+#
+# Prototype:
+#
+# double fmax(double x, double y)
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fmax)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ MOVAPD %xmm0,%xmm3
+
+ MAXSD %xmm1,%xmm0
+ MOVAPD %xmm0,%xmm2
+
+ #If the input is nan then specal case to return the other operand
+ CMPEQSD %xmm2,%xmm2
+ PAND %xmm2,%xmm0
+
+ PANDN %xmm3,%xmm2
+ POR %xmm2,%xmm0
+
+ ret
+
diff --git a/src/gas/fmaxf.S b/src/gas/fmaxf.S
new file mode 100644
index 0000000..828832f
--- /dev/null
+++ b/src/gas/fmaxf.S
@@ -0,0 +1,66 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fmaxf.S
+#
+# An implementation of the fmaxf libm function.
+#
+# The fmax functions determine the maximum numeric value of their arguments.
+#
+# Prototype:
+#
+# float fmaxf(float x, float y)
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fmaxf)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ MOVAPD %xmm0,%xmm3
+
+ MAXSS %xmm1,%xmm0
+ MOVAPD %xmm0,%xmm2
+
+ #If the input is nan then specal case to return the other operand
+ CMPEQSS %xmm2,%xmm2
+ PAND %xmm2,%xmm0
+
+ PANDN %xmm3,%xmm2
+ POR %xmm2,%xmm0
+
+ ret
+
diff --git a/src/gas/fmin.S b/src/gas/fmin.S
new file mode 100644
index 0000000..79b3fb6
--- /dev/null
+++ b/src/gas/fmin.S
@@ -0,0 +1,66 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fmin.S
+#
+# An implementation of the fmin libm function.
+#
+# The fmin functions determine the minimum numeric value of their arguments
+#
+# Prototype:
+#
+# double fmin(double x, double y)
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fmin)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ MOVAPD %xmm0,%xmm3
+
+ MINSD %xmm1,%xmm0
+ MOVAPD %xmm0,%xmm2
+
+ #If the input is nan then specal case to return the other operand
+ CMPEQSD %xmm2,%xmm2
+ PAND %xmm2,%xmm0
+
+ PANDN %xmm3,%xmm2
+ POR %xmm2,%xmm0
+
+ ret
+
diff --git a/src/gas/fminf.S b/src/gas/fminf.S
new file mode 100644
index 0000000..34ee357
--- /dev/null
+++ b/src/gas/fminf.S
@@ -0,0 +1,66 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#fminf.S
+#
+# An implementation of the fminf libm function.
+#
+# The fmin functions determine the minimum numeric value of their arguments
+#
+#
+# Prototype:
+#
+# float fminf(float x, float y)
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fminf)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ MOVAPD %xmm0,%xmm3
+
+ MINSS %xmm1,%xmm0
+ MOVAPD %xmm0,%xmm2
+
+ #If the input is nan then specal case to return the other operand
+ CMPEQSS %xmm2,%xmm2
+ PAND %xmm2,%xmm0
+
+ PANDN %xmm3,%xmm2
+ POR %xmm2,%xmm0
+
+ ret
diff --git a/src/gas/fmod.S b/src/gas/fmod.S
new file mode 100644
index 0000000..bc1eeae
--- /dev/null
+++ b/src/gas/fmod.S
@@ -0,0 +1,223 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# fmod.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+# double fmod(double x,double y);
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fmod)
+#define fname_special _fmod_special
+
+
+# local variable storage offsets
+.equ temp_x, 0x0
+.equ temp_y, 0x10
+.equ stack_size, 0x28
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+ mov .L__exp_mask_64(%rip), %r10
+ #move the input to GP registers
+ movd %xmm0,%r8
+ movd %xmm1,%r9
+ movapd %xmm0,%xmm4
+ movapd %xmm1,%xmm5
+ movapd .L__Nan_64(%rip),%xmm6
+ and %r10,%r8
+ and %r10,%r9
+ ror $52, %r8
+ ror $52, %r9
+ #ifeither of the exponents is zero we do the fmod calculation in x87 mode
+ test %r8, %r8
+ jz .L__LargeExpDiffComputation
+ mov %r9,%r10
+ test %r9, %r9
+ jz .L__LargeExpDiffComputation
+ sub %r9,%r8
+ cmp $52,%r8
+ jge .L__LargeExpDiffComputation
+ pand %xmm6,%xmm4
+ pand %xmm6,%xmm5
+ comisd %xmm5,%xmm4
+ jp .L__InputIsNaN # if either of xmm1 or xmm0 is a NaN then
+ # parity flag is set
+ jz .L__Input_Is_Equal
+ jbe .L__ReturnImmediate
+ cmp $0x7FF,%r8
+ jz .L__Dividend_Is_Infinity
+
+ #calculation without using the x87 FPU
+.L__DirectComputation:
+ movapd %xmm4,%xmm2
+ movapd %xmm5,%xmm3
+ divsd %xmm3,%xmm2
+ cvttsd2siq %xmm2,%r8
+ cvtsi2sdq %r8,%xmm2
+
+ #multiplication in QUAD Precision
+ #Since the below commented multiplication resulted in an error
+ #we had to implement a quad precision multiplication.
+ #LOGIC behind Quad Precision Multiplication
+ #x = hx + tx by setting x's last 27 bits to null
+ #y = hy + ty similar to x
+ movapd .L__27bit_andingmask_64(%rip),%xmm4
+ #movddup %xmm5,%xmm5 #[x,x]
+ #movddup %xmm2,%xmm2 #[y,y]
+
+ movapd %xmm5,%xmm1 # x
+ movapd %xmm2,%xmm6 # y
+ movapd %xmm2,%xmm7 #
+ mulsd %xmm5,%xmm7 # xmm7 = z = x*y
+ andpd %xmm4,%xmm1
+ andpd %xmm4,%xmm2
+ subsd %xmm1,%xmm5 # xmm1 = hx xmm5 = tx
+ subsd %xmm2,%xmm6 # xmm2 = hy xmm6 = ty
+
+ movapd %xmm1,%xmm4 # copy hx
+ mulsd %xmm2,%xmm4 # xmm4 = hx*hy
+ subsd %xmm7,%xmm4 # xmm4 = (hx*hy - z)
+ mulsd %xmm6,%xmm1 # xmm1 = hx * ty
+ addsd %xmm1,%xmm4 # xmm4 = ((hx * hy - *z) + hx * ty)
+ mulsd %xmm5,%xmm2 # xmm2 = tx * hy
+ addsd %xmm2,%xmm4 # xmm4 = (((hx * hy - *z) + hx * ty) + tx * hy)
+ mulsd %xmm5,%xmm6 # xmm6 = tx * ty
+ addsd %xmm4,%xmm6 # xmm6 = (((hx * hy - *z) + hx * ty) + tx * hy) + tx * ty;
+ #xmm6 and xmm7 contain the quad precision result
+ #v = dx - c;
+ #dx = v + (((dx - v) - c) - cc);
+ movapd %xmm0,%xmm1 # copy the input number
+ pand .L__Nan_64(%rip),%xmm1
+ movapd %xmm1,%xmm2 # xmm2 = dx = xmm1
+ subsd %xmm7,%xmm1 # v = dx - c
+ subsd %xmm1,%xmm2 # (dx - v)
+ subsd %xmm7,%xmm2 # ((dx - v) - c)
+ subsd %xmm6,%xmm2 # (((dx - v) - c) - cc)
+ addsd %xmm1,%xmm2 # xmm2 = dx = v + (((dx - v) - c) - cc)
+ # xmm3 = w
+ comisd .L__Zero_64(%rip),%xmm2
+ jae .L__positive
+ addsd %xmm3,%xmm2
+.L__positive:
+# return x < 0.0? -dx : dx;
+.L__Finish:
+ comisd .L__Zero_64(%rip), %xmm0
+ ja .L__Not_Negative_Number1
+
+.L__Negative_Number1:
+ movapd .L__Zero_64(%rip),%xmm0
+ subsd %xmm2,%xmm0
+ ret
+.L__Not_Negative_Number1:
+ movapd %xmm2,%xmm0
+ ret
+
+ #calculation using the x87 FPU
+ #For numbers whose exponent of either of the divisor,
+ #or dividends are 0. Or for numbers whose exponential
+ #diff is grater than 52
+.align 16
+.L__LargeExpDiffComputation:
+ sub $stack_size, %rsp
+ movsd %xmm0, temp_x(%rsp)
+ movsd %xmm1, temp_y(%rsp)
+ ffree %st(0)
+ ffree %st(1)
+ fldl temp_y(%rsp)
+ fldl temp_x(%rsp)
+ fnclex
+.align 32
+.L__repeat:
+ fprem #Calculate remainder by dividing st(0) with st(1)
+ #fprem operation sets x87 condition codes,
+ #it will set the C2 code to 1 if a partial remainder is calculated
+ fnstsw %ax
+ and $0x0400,%ax # Stores Floating-Point Store Status Word into the accumulator
+ # we need to check only the C2 bit of the Condition codes
+ cmp $0x0400,%ax # Checks whether the bit 10(C2) is set or not
+ # IF its set then a partial remainder was calculated
+ jz .L__repeat
+ #store the result from the FPU stack to memory
+ fstpl temp_x(%rsp)
+ fstpl temp_y(%rsp)
+ movsd temp_x(%rsp), %xmm0
+ add $stack_size, %rsp
+ ret
+
+ #IF both the inputs are equal
+.L__Input_Is_Equal:
+ cmp $0x7FF,%r8
+ jz .L__Dividend_Is_Infinity
+ cmp $0x7FF,%r9
+ jz .L__InputIsNaN
+ movsd %xmm0,%xmm1
+ pand .L__sign_mask_64(%rip),%xmm1
+ movsd .L__Zero_64(%rip),%xmm0
+ por %xmm1,%xmm0
+ ret
+
+.L__InputIsNaN:
+ por .L__QNaN_mask_64(%rip),%xmm0
+ por .L__exp_mask_64(%rip),%xmm0
+.L__Dividend_Is_Infinity:
+ ret
+
+#Case when x < y
+.L__ReturnImmediate:
+ ret
+
+
+
+.align 32
+.L__sign_mask_64: .quad 0x8000000000000000
+ .quad 0x0
+.L__exp_mask_64: .quad 0x7FF0000000000000
+ .quad 0x0
+.L__27bit_andingmask_64: .quad 0xfffffffff8000000
+ .quad 0
+.L__2p52_mask_64: .quad 0x4330000000000000
+ .quad 0
+.L__Zero_64: .quad 0x0
+ .quad 0
+.L__QNaN_mask_64: .quad 0x0008000000000000
+ .quad 0
+.L__Nan_64: .quad 0x7FFFFFFFFFFFFFFF
+ .quad 0
+
diff --git a/src/gas/fmodf.S b/src/gas/fmodf.S
new file mode 100644
index 0000000..c31d619
--- /dev/null
+++ b/src/gas/fmodf.S
@@ -0,0 +1,181 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# fmodf.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+# float fmodf(float x,float y);
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(fmodf)
+#define fname_special _fmodf_special
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+ mov .L__exp_mask_64(%rip), %rdi
+ movapd .L__sign_mask_64(%rip),%xmm6
+ cvtss2sd %xmm0,%xmm2 # double x
+ cvtss2sd %xmm1,%xmm3 # double y
+ pand %xmm6,%xmm2
+ pand %xmm6,%xmm3
+ movd %xmm2,%rax
+ movd %xmm3,%r8
+ mov %rax,%r11
+ mov %r8,%r9
+ movsd %xmm2,%xmm4
+ #take the exponents of both x and y
+ and %rdi,%rax
+ and %rdi,%r8
+ ror $52, %rax
+ ror $52, %r8
+ # ifeither of the exponents is infinity
+ cmp $0X7FF,%rax
+ jz .L__InputIsNaN
+ cmp $0X7FF,%r8
+ jz .L__InputIsNaNOrInf
+
+ cmp $0,%r8
+ jz .L__Divisor_Is_Zero
+
+ cmp %r9, %r11
+ jz .L__Input_Is_Equal
+ jb .L__ReturnImmediate
+
+ xor %rcx,%rcx
+ mov $24,%rdx
+ movsd .L__One_64(%rip),%xmm7 # xmm7 = scale
+ cmp %rax,%r8
+ jae .L__y_is_greater
+ #xmm3 = dy
+ sub %r8,%rax
+ div %dl # al = ntimes
+ mov %al,%cl # cl = ntimes
+ and $0xFF,%ax # set everything t o zero except al
+ mul %dl # ax = dl * al = 24* ntimes
+ add $1023, %rax
+ shl $52,%rax
+ movd %rax,%xmm7 # xmm7 = scale
+.L__y_is_greater:
+ mulsd %xmm3,%xmm7 # xmm7 = scale * dy
+ movsd .L__2pminus24_decimal(%rip),%xmm6
+
+.align 16
+.L__Start_Loop:
+ dec %cl
+ js .L__End_Loop
+ divsd %xmm7,%xmm4 # xmm7 = (dx / w)
+ cvttsd2siq %xmm4,%rax
+ cvtsi2sdq %rax,%xmm4 # xmm4 = t = (double)((int)(dx / w))
+ mulsd %xmm7,%xmm4 # xmm4 = w*t
+ mulsd %xmm6,%xmm7 # w*= scale
+ subsd %xmm4,%xmm2 # xmm2 = dx -= w*t
+ movsd %xmm2,%xmm4 # xmm4 = dx
+ jmp .L__Start_Loop
+.L__End_Loop:
+ divsd %xmm7,%xmm4 # xmm7 = (dx / w)
+ cvttsd2siq %xmm4,%rax
+ cvtsi2sdq %rax,%xmm4 # xmm4 = t = (double)((int)(dx / w))
+ mulsd %xmm7,%xmm4 # xmm4 = w*t
+ subsd %xmm4,%xmm2 # xmm2 = dx -= w*t
+ comiss .L__Zero_64(%rip),%xmm0
+ jb .L__Negative
+.L__Positive:
+ cvtsd2ss %xmm2,%xmm0
+ ret
+.L__Negative:
+ movsd .L__MinusZero_64(%rip),%xmm0
+ subsd %xmm2,%xmm0
+ cvtsd2ss %xmm0,%xmm0
+ ret
+
+.align 16
+.L__Input_Is_Equal:
+ cmp $0x7FF,%rax
+ jz .L__Dividend_Is_Infinity
+ cmp $0x7FF,%r8
+ jz .L__InputIsNaNOrInf
+ movsd %xmm0,%xmm1
+ pand .L__sign_bit_32(%rip),%xmm1
+ movss .L__Zero_64(%rip),%xmm0
+ por %xmm1,%xmm0
+ ret
+
+.L__InputIsNaNOrInf:
+ comiss %xmm0,%xmm1
+ jp .L__InputIsNaN
+ ret
+.L__Divisor_Is_Zero:
+.L__InputIsNaN:
+ por .L__exp_mask_32(%rip),%xmm0
+.L__Dividend_Is_Infinity:
+ por .L__QNaN_mask_32(%rip),%xmm0
+ ret
+
+#Case when x < y
+.L__ReturnImmediate:
+ #xmm0 contains the input and is the result
+ ret
+
+
+
+.align 32
+.L__sign_bit_32: .quad 0x8000000080000000
+ .quad 0x0
+.L__exp_mask_64: .quad 0x7FF0000000000000
+ .quad 0x0
+.L__exp_mask_32: .quad 0x000000007F800000
+ .quad 0x0
+.L__27bit_andingmask_64: .quad 0xfffffffff8000000
+ .quad 0
+.L__2p52_mask_64: .quad 0x4330000000000000
+ .quad 0
+.L__One_64: .quad 0x3FF0000000000000
+ .quad 0
+.L__Zero_64: .quad 0x0
+ .quad 0
+.L__MinusZero_64: .quad 0x8000000000000000
+ .quad 0
+.L__QNaN_mask_32: .quad 0x0000000000400000
+ .quad 0
+.L__sign_mask_64: .quad 0x7FFFFFFFFFFFFFFF
+ .quad 0
+.L__2pminus24_decimal: .quad 0x3E70000000000000
+ .quad 0
+
diff --git a/src/gas/log.S b/src/gas/log.S
new file mode 100644
index 0000000..7068c6d
--- /dev/null
+++ b/src/gas/log.S
@@ -0,0 +1,1155 @@
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+#ifdef __x86_64__
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# log.S
+#
+# An implementation of the log libm function.
+#
+# Prototype:
+#
+# double log(double x);
+#
+
+#
+# Algorithm:
+#
+# Based on:
+# Ping-Tak Peter Tang
+# "Table-driven implementation of the logarithm function in IEEE
+# floating-point arithmetic"
+# ACM Transactions on Mathematical Software (TOMS)
+# Volume 16, Issue 4 (December 1990)
+#
+#
+# x very close to 1.0 is handled differently, for x everywhere else
+# a brief explanation is given below
+#
+# x = (2^m)*A
+# x = (2^m)*(G+g) with (1 <= G < 2) and (g <= 2^(-9))
+# x = (2^m)*2*(G/2+g/2)
+# x = (2^m)*2*(F+f) with (0.5 <= F < 1) and (f <= 2^(-10))
+#
+# Y = (2^(-1))*(2^(-m))*(2^m)*A
+# Now, range of Y is: 0.5 <= Y < 1
+#
+# F = 0x100 + (first 8 mantissa bits) + (9th mantissa bit)
+# Now, range of F is: 256 <= F <= 512
+# F = F / 512
+# Now, range of F is: 0.5 <= F <= 1
+#
+# f = -(Y-F), with (f <= 2^(-10))
+#
+# log(x) = m*log(2) + log(2) + log(F-f)
+# log(x) = m*log(2) + log(2) + log(F) + log(1-(f/F))
+# log(x) = m*log(2) + log(2*F) + log(1-r)
+#
+# r = (f/F), with (r <= 2^(-9))
+# r = f*(1/F) with (1/F) precomputed to avoid division
+#
+# log(x) = m*log(2) + log(G) - poly
+#
+# log(G) is precomputed
+# poly = (r + (r^2)/2 + (r^3)/3 + (r^4)/4) + (r^5)/5) + (r^6)/6))
+#
+# log(2) and log(G) need to be maintained in extra precision
+# to avoid losing precision in the calculations
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(log)
+#define fname_special _log_special@PLT
+
+
+# local variable storage offsets
+.equ p_temp, 0x0
+.equ stack_size, 0x18
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ sub $stack_size, %rsp
+
+ # compute exponent part
+ xor %rax, %rax
+ movdqa %xmm0, %xmm3
+ movsd %xmm0, %xmm4
+ psrlq $52, %xmm3
+ movd %xmm0, %rax
+ psubq .L__mask_1023(%rip), %xmm3
+ movdqa %xmm0, %xmm2
+ cvtdq2pd %xmm3, %xmm6 # xexp
+
+ # NaN or inf
+ movdqa %xmm0, %xmm5
+ andpd .L__real_inf(%rip), %xmm5
+ comisd .L__real_inf(%rip), %xmm5
+ je .L__x_is_inf_or_nan
+
+ # check for negative numbers or zero
+ xorpd %xmm5, %xmm5
+ comisd %xmm5, %xmm0
+ jbe .L__x_is_zero_or_neg
+
+ pand .L__real_mant(%rip), %xmm2
+ subsd .L__real_one(%rip), %xmm4
+
+ comisd .L__mask_1023_f(%rip), %xmm6
+ je .L__denormal_adjust
+
+.L__continue_common:
+
+ # compute index into the log tables
+ mov %rax, %r9
+ and .L__mask_mant_all8(%rip), %rax
+ and .L__mask_mant9(%rip), %r9
+ shl $1, %r9
+ add %r9, %rax
+ mov %rax, p_temp(%rsp)
+
+ # near one codepath
+ andpd .L__real_notsign(%rip), %xmm4
+ comisd .L__real_threshold(%rip), %xmm4
+ jb .L__near_one
+
+ # F, Y
+ movsd p_temp(%rsp), %xmm1
+ shr $44, %rax
+ por .L__real_half(%rip), %xmm2
+ por .L__real_half(%rip), %xmm1
+ lea .L__log_F_inv(%rip), %r9
+
+ # f = F - Y, r = f * inv
+ subsd %xmm2, %xmm1
+ mulsd (%r9,%rax,8), %xmm1
+
+ movsd %xmm1, %xmm2
+ movsd %xmm1, %xmm0
+ lea .L__log_256_lead(%rip), %r9
+
+ # poly
+ movsd .L__real_1_over_6(%rip), %xmm3
+ movsd .L__real_1_over_3(%rip), %xmm1
+ mulsd %xmm2, %xmm3
+ mulsd %xmm2, %xmm1
+ mulsd %xmm2, %xmm0
+ movsd %xmm0, %xmm4
+ addsd .L__real_1_over_5(%rip), %xmm3
+ addsd .L__real_1_over_2(%rip), %xmm1
+ mulsd %xmm0, %xmm4
+ mulsd %xmm2, %xmm3
+ mulsd %xmm0, %xmm1
+ addsd .L__real_1_over_4(%rip), %xmm3
+ addsd %xmm2, %xmm1
+ mulsd %xmm4, %xmm3
+ addsd %xmm3, %xmm1
+
+ # m*log(2) + log(G) - poly
+ movsd .L__real_log2_tail(%rip), %xmm5
+ mulsd %xmm6, %xmm5
+ subsd %xmm1, %xmm5
+
+ movsd (%r9,%rax,8), %xmm0
+ lea .L__log_256_tail(%rip), %rdx
+ movsd (%rdx,%rax,8), %xmm2
+ addsd %xmm5, %xmm2
+
+ movsd .L__real_log2_lead(%rip), %xmm4
+ mulsd %xmm6, %xmm4
+ addsd %xmm4, %xmm0
+
+ addsd %xmm2, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__near_one:
+
+ # r = x - 1.0
+ movsd .L__real_two(%rip), %xmm2
+ subsd .L__real_one(%rip), %xmm0 # r
+
+ addsd %xmm0, %xmm2
+ movsd %xmm0, %xmm1
+ divsd %xmm2, %xmm1 # r/(2+r) = u/2
+
+ movsd .L__real_ca2(%rip), %xmm4
+ movsd .L__real_ca4(%rip), %xmm5
+
+ movsd %xmm0, %xmm6
+ mulsd %xmm1, %xmm6 # correction
+
+ addsd %xmm1, %xmm1 # u
+ movsd %xmm1, %xmm2
+
+ mulsd %xmm1, %xmm2 # u^2
+
+ mulsd %xmm2, %xmm4
+ mulsd %xmm2, %xmm5
+
+ addsd .L__real_ca1(%rip), %xmm4
+ addsd .L__real_ca3(%rip), %xmm5
+
+ mulsd %xmm1, %xmm2 # u^3
+ mulsd %xmm2, %xmm4
+
+ mulsd %xmm2, %xmm2
+ mulsd %xmm1, %xmm2 # u^7
+ mulsd %xmm2, %xmm5
+
+ addsd %xmm5, %xmm4
+ subsd %xmm6, %xmm4
+ addsd %xmm4, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.L__denormal_adjust:
+ por .L__real_one(%rip), %xmm2
+ subsd .L__real_one(%rip), %xmm2
+ movsd %xmm2, %xmm5
+ pand .L__real_mant(%rip), %xmm2
+ movd %xmm2, %rax
+ psrlq $52, %xmm5
+ psubd .L__mask_2045(%rip), %xmm5
+ cvtdq2pd %xmm5, %xmm6
+ jmp .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+ jne .L__x_is_neg
+
+ movsd .L__real_ninf(%rip), %xmm1
+ mov .L__flag_x_zero(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+ movsd .L__real_qnan(%rip), %xmm1
+ mov .L__flag_x_neg(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+ cmp .L__real_inf(%rip), %rax
+ je .L__finish
+
+ cmp .L__real_ninf(%rip), %rax
+ je .L__x_is_neg
+
+ mov .L__real_qnanbit(%rip), %r9
+ and %rax, %r9
+ jnz .L__finish
+
+ or .L__real_qnanbit(%rip), %rax
+ movd %rax, %xmm1
+ mov .L__flag_x_nan(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__finish:
+ add $stack_size, %rsp
+ ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero: .long 00000001
+.L__flag_x_neg: .long 00000002
+.L__flag_x_nan: .long 00000003
+
+.align 16
+
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0000000000000000
+.L__real_inf: .quad 0x7ff0000000000000 # +inf
+ .quad 0x0000000000000000
+.L__real_qnan: .quad 0x7ff8000000000000 # qNaN
+ .quad 0x0000000000000000
+.L__real_qnanbit: .quad 0x0008000000000000
+ .quad 0x0000000000000000
+.L__real_mant: .quad 0x000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000000000000000
+.L__mask_1023: .quad 0x00000000000003ff
+ .quad 0x0000000000000000
+.L__mask_001: .quad 0x0000000000000001
+ .quad 0x0000000000000000
+
+.L__mask_mant_all8: .quad 0x000ff00000000000
+ .quad 0x0000000000000000
+.L__mask_mant9: .quad 0x0000080000000000
+ .quad 0x0000000000000000
+
+.L__real_log2_lead: .quad 0x3fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x0000000000000000
+.L__real_log2_tail: .quad 0x3e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x0000000000000000
+
+.L__real_two: .quad 0x4000000000000000 # 2
+ .quad 0x0000000000000000
+
+.L__real_one: .quad 0x3ff0000000000000 # 1
+ .quad 0x0000000000000000
+
+.L__real_half: .quad 0x3fe0000000000000 # 1/2
+ .quad 0x0000000000000000
+
+.L__mask_100: .quad 0x0000000000000100
+ .quad 0x0000000000000000
+
+.L__real_1_over_512: .quad 0x3f60000000000000
+ .quad 0x0000000000000000
+
+.L__real_1_over_2: .quad 0x3fe0000000000000
+ .quad 0x0000000000000000
+.L__real_1_over_3: .quad 0x3fd5555555555555
+ .quad 0x0000000000000000
+.L__real_1_over_4: .quad 0x3fd0000000000000
+ .quad 0x0000000000000000
+.L__real_1_over_5: .quad 0x3fc999999999999a
+ .quad 0x0000000000000000
+.L__real_1_over_6: .quad 0x3fc5555555555555
+ .quad 0x0000000000000000
+
+.L__mask_1023_f: .quad 0x0c08ff80000000000
+ .quad 0x0000000000000000
+
+.L__mask_2045: .quad 0x00000000000007fd
+ .quad 0x0000000000000000
+
+.L__real_threshold: .quad 0x3fb0000000000000 # .0625
+ .quad 0x0000000000000000
+
+.L__real_notsign: .quad 0x7ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x0000000000000000
+
+.L__real_ca1: .quad 0x3fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x0000000000000000
+.L__real_ca2: .quad 0x3f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x0000000000000000
+.L__real_ca3: .quad 0x3f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x0000000000000000
+.L__real_ca4: .quad 0x3f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x0000000000000000
+
+.align 16
+.L__log_256_lead:
+ .quad 0x0000000000000000
+ .quad 0x3f6ff00aa0000000
+ .quad 0x3f7fe02a60000000
+ .quad 0x3f87dc4750000000
+ .quad 0x3f8fc0a8b0000000
+ .quad 0x3f93cea440000000
+ .quad 0x3f97b91b00000000
+ .quad 0x3f9b9fc020000000
+ .quad 0x3f9f829b00000000
+ .quad 0x3fa1b0d980000000
+ .quad 0x3fa39e87b0000000
+ .quad 0x3fa58a5ba0000000
+ .quad 0x3fa77458f0000000
+ .quad 0x3fa95c8300000000
+ .quad 0x3fab42dd70000000
+ .quad 0x3fad276b80000000
+ .quad 0x3faf0a30c0000000
+ .quad 0x3fb0759830000000
+ .quad 0x3fb16536e0000000
+ .quad 0x3fb253f620000000
+ .quad 0x3fb341d790000000
+ .quad 0x3fb42edcb0000000
+ .quad 0x3fb51b0730000000
+ .quad 0x3fb60658a0000000
+ .quad 0x3fb6f0d280000000
+ .quad 0x3fb7da7660000000
+ .quad 0x3fb8c345d0000000
+ .quad 0x3fb9ab4240000000
+ .quad 0x3fba926d30000000
+ .quad 0x3fbb78c820000000
+ .quad 0x3fbc5e5480000000
+ .quad 0x3fbd4313d0000000
+ .quad 0x3fbe270760000000
+ .quad 0x3fbf0a30c0000000
+ .quad 0x3fbfec9130000000
+ .quad 0x3fc0671510000000
+ .quad 0x3fc0d77e70000000
+ .quad 0x3fc1478580000000
+ .quad 0x3fc1b72ad0000000
+ .quad 0x3fc2266f10000000
+ .quad 0x3fc29552f0000000
+ .quad 0x3fc303d710000000
+ .quad 0x3fc371fc20000000
+ .quad 0x3fc3dfc2b0000000
+ .quad 0x3fc44d2b60000000
+ .quad 0x3fc4ba36f0000000
+ .quad 0x3fc526e5e0000000
+ .quad 0x3fc59338d0000000
+ .quad 0x3fc5ff3070000000
+ .quad 0x3fc66acd40000000
+ .quad 0x3fc6d60fe0000000
+ .quad 0x3fc740f8f0000000
+ .quad 0x3fc7ab8900000000
+ .quad 0x3fc815c0a0000000
+ .quad 0x3fc87fa060000000
+ .quad 0x3fc8e928d0000000
+ .quad 0x3fc9525a90000000
+ .quad 0x3fc9bb3620000000
+ .quad 0x3fca23bc10000000
+ .quad 0x3fca8becf0000000
+ .quad 0x3fcaf3c940000000
+ .quad 0x3fcb5b5190000000
+ .quad 0x3fcbc28670000000
+ .quad 0x3fcc296850000000
+ .quad 0x3fcc8ff7c0000000
+ .quad 0x3fccf63540000000
+ .quad 0x3fcd5c2160000000
+ .quad 0x3fcdc1bca0000000
+ .quad 0x3fce270760000000
+ .quad 0x3fce8c0250000000
+ .quad 0x3fcef0adc0000000
+ .quad 0x3fcf550a50000000
+ .quad 0x3fcfb91860000000
+ .quad 0x3fd00e6c40000000
+ .quad 0x3fd0402590000000
+ .quad 0x3fd071b850000000
+ .quad 0x3fd0a324e0000000
+ .quad 0x3fd0d46b50000000
+ .quad 0x3fd1058bf0000000
+ .quad 0x3fd1368700000000
+ .quad 0x3fd1675ca0000000
+ .quad 0x3fd1980d20000000
+ .quad 0x3fd1c898c0000000
+ .quad 0x3fd1f8ff90000000
+ .quad 0x3fd22941f0000000
+ .quad 0x3fd2596010000000
+ .quad 0x3fd2895a10000000
+ .quad 0x3fd2b93030000000
+ .quad 0x3fd2e8e2b0000000
+ .quad 0x3fd31871c0000000
+ .quad 0x3fd347dd90000000
+ .quad 0x3fd3772660000000
+ .quad 0x3fd3a64c50000000
+ .quad 0x3fd3d54fa0000000
+ .quad 0x3fd4043080000000
+ .quad 0x3fd432ef20000000
+ .quad 0x3fd4618bc0000000
+ .quad 0x3fd4900680000000
+ .quad 0x3fd4be5f90000000
+ .quad 0x3fd4ec9730000000
+ .quad 0x3fd51aad80000000
+ .quad 0x3fd548a2c0000000
+ .quad 0x3fd5767710000000
+ .quad 0x3fd5a42ab0000000
+ .quad 0x3fd5d1bdb0000000
+ .quad 0x3fd5ff3070000000
+ .quad 0x3fd62c82f0000000
+ .quad 0x3fd659b570000000
+ .quad 0x3fd686c810000000
+ .quad 0x3fd6b3bb20000000
+ .quad 0x3fd6e08ea0000000
+ .quad 0x3fd70d42e0000000
+ .quad 0x3fd739d7f0000000
+ .quad 0x3fd7664e10000000
+ .quad 0x3fd792a550000000
+ .quad 0x3fd7bede00000000
+ .quad 0x3fd7eaf830000000
+ .quad 0x3fd816f410000000
+ .quad 0x3fd842d1d0000000
+ .quad 0x3fd86e9190000000
+ .quad 0x3fd89a3380000000
+ .quad 0x3fd8c5b7c0000000
+ .quad 0x3fd8f11e80000000
+ .quad 0x3fd91c67e0000000
+ .quad 0x3fd9479410000000
+ .quad 0x3fd972a340000000
+ .quad 0x3fd99d9580000000
+ .quad 0x3fd9c86b00000000
+ .quad 0x3fd9f323e0000000
+ .quad 0x3fda1dc060000000
+ .quad 0x3fda484090000000
+ .quad 0x3fda72a490000000
+ .quad 0x3fda9cec90000000
+ .quad 0x3fdac718c0000000
+ .quad 0x3fdaf12930000000
+ .quad 0x3fdb1b1e00000000
+ .quad 0x3fdb44f770000000
+ .quad 0x3fdb6eb590000000
+ .quad 0x3fdb985890000000
+ .quad 0x3fdbc1e080000000
+ .quad 0x3fdbeb4d90000000
+ .quad 0x3fdc149ff0000000
+ .quad 0x3fdc3dd7a0000000
+ .quad 0x3fdc66f4e0000000
+ .quad 0x3fdc8ff7c0000000
+ .quad 0x3fdcb8e070000000
+ .quad 0x3fdce1af00000000
+ .quad 0x3fdd0a63a0000000
+ .quad 0x3fdd32fe70000000
+ .quad 0x3fdd5b7f90000000
+ .quad 0x3fdd83e720000000
+ .quad 0x3fddac3530000000
+ .quad 0x3fddd46a00000000
+ .quad 0x3fddfc8590000000
+ .quad 0x3fde248810000000
+ .quad 0x3fde4c71a0000000
+ .quad 0x3fde744260000000
+ .quad 0x3fde9bfa60000000
+ .quad 0x3fdec399d0000000
+ .quad 0x3fdeeb20c0000000
+ .quad 0x3fdf128f50000000
+ .quad 0x3fdf39e5b0000000
+ .quad 0x3fdf6123f0000000
+ .quad 0x3fdf884a30000000
+ .quad 0x3fdfaf5880000000
+ .quad 0x3fdfd64f20000000
+ .quad 0x3fdffd2e00000000
+ .quad 0x3fe011fab0000000
+ .quad 0x3fe02552a0000000
+ .quad 0x3fe0389ee0000000
+ .quad 0x3fe04bdf90000000
+ .quad 0x3fe05f14b0000000
+ .quad 0x3fe0723e50000000
+ .quad 0x3fe0855c80000000
+ .quad 0x3fe0986f40000000
+ .quad 0x3fe0ab76b0000000
+ .quad 0x3fe0be72e0000000
+ .quad 0x3fe0d163c0000000
+ .quad 0x3fe0e44980000000
+ .quad 0x3fe0f72410000000
+ .quad 0x3fe109f390000000
+ .quad 0x3fe11cb810000000
+ .quad 0x3fe12f7190000000
+ .quad 0x3fe1422020000000
+ .quad 0x3fe154c3d0000000
+ .quad 0x3fe1675ca0000000
+ .quad 0x3fe179eab0000000
+ .quad 0x3fe18c6e00000000
+ .quad 0x3fe19ee6b0000000
+ .quad 0x3fe1b154b0000000
+ .quad 0x3fe1c3b810000000
+ .quad 0x3fe1d610f0000000
+ .quad 0x3fe1e85f50000000
+ .quad 0x3fe1faa340000000
+ .quad 0x3fe20cdcd0000000
+ .quad 0x3fe21f0bf0000000
+ .quad 0x3fe23130d0000000
+ .quad 0x3fe2434b60000000
+ .quad 0x3fe2555bc0000000
+ .quad 0x3fe2676200000000
+ .quad 0x3fe2795e10000000
+ .quad 0x3fe28b5000000000
+ .quad 0x3fe29d37f0000000
+ .quad 0x3fe2af15f0000000
+ .quad 0x3fe2c0e9e0000000
+ .quad 0x3fe2d2b400000000
+ .quad 0x3fe2e47430000000
+ .quad 0x3fe2f62a90000000
+ .quad 0x3fe307d730000000
+ .quad 0x3fe3197a00000000
+ .quad 0x3fe32b1330000000
+ .quad 0x3fe33ca2b0000000
+ .quad 0x3fe34e2890000000
+ .quad 0x3fe35fa4e0000000
+ .quad 0x3fe37117b0000000
+ .quad 0x3fe38280f0000000
+ .quad 0x3fe393e0d0000000
+ .quad 0x3fe3a53730000000
+ .quad 0x3fe3b68440000000
+ .quad 0x3fe3c7c7f0000000
+ .quad 0x3fe3d90260000000
+ .quad 0x3fe3ea3390000000
+ .quad 0x3fe3fb5b80000000
+ .quad 0x3fe40c7a40000000
+ .quad 0x3fe41d8fe0000000
+ .quad 0x3fe42e9c60000000
+ .quad 0x3fe43f9fe0000000
+ .quad 0x3fe4509a50000000
+ .quad 0x3fe4618bc0000000
+ .quad 0x3fe4727430000000
+ .quad 0x3fe48353d0000000
+ .quad 0x3fe4942a80000000
+ .quad 0x3fe4a4f850000000
+ .quad 0x3fe4b5bd60000000
+ .quad 0x3fe4c679a0000000
+ .quad 0x3fe4d72d30000000
+ .quad 0x3fe4e7d810000000
+ .quad 0x3fe4f87a30000000
+ .quad 0x3fe50913c0000000
+ .quad 0x3fe519a4c0000000
+ .quad 0x3fe52a2d20000000
+ .quad 0x3fe53aad00000000
+ .quad 0x3fe54b2460000000
+ .quad 0x3fe55b9350000000
+ .quad 0x3fe56bf9d0000000
+ .quad 0x3fe57c57f0000000
+ .quad 0x3fe58cadb0000000
+ .quad 0x3fe59cfb20000000
+ .quad 0x3fe5ad4040000000
+ .quad 0x3fe5bd7d30000000
+ .quad 0x3fe5cdb1d0000000
+ .quad 0x3fe5ddde50000000
+ .quad 0x3fe5ee02a0000000
+ .quad 0x3fe5fe1ed0000000
+ .quad 0x3fe60e32f0000000
+ .quad 0x3fe61e3ef0000000
+ .quad 0x3fe62e42e0000000
+ .quad 0x0000000000000000
+
+.align 16
+.L__log_256_tail:
+ .quad 0x0000000000000000
+ .quad 0x3db5885e0250435a
+ .quad 0x3de620cf11f86ed2
+ .quad 0x3dff0214edba4a25
+ .quad 0x3dbf807c79f3db4e
+ .quad 0x3dea352ba779a52b
+ .quad 0x3dff56c46aa49fd5
+ .quad 0x3dfebe465fef5196
+ .quad 0x3e0cf0660099f1f8
+ .quad 0x3e1247b2ff85945d
+ .quad 0x3e13fd7abf5202b6
+ .quad 0x3e1f91c9a918d51e
+ .quad 0x3e08cb73f118d3ca
+ .quad 0x3e1d91c7d6fad074
+ .quad 0x3de1971bec28d14c
+ .quad 0x3e15b616a423c78a
+ .quad 0x3da162a6617cc971
+ .quad 0x3e166391c4c06d29
+ .quad 0x3e2d46f5c1d0c4b8
+ .quad 0x3e2e14282df1f6d3
+ .quad 0x3e186f47424a660d
+ .quad 0x3e2d4c8de077753e
+ .quad 0x3e2e0c307ed24f1c
+ .quad 0x3e226ea18763bdd3
+ .quad 0x3e25cad69737c933
+ .quad 0x3e2af62599088901
+ .quad 0x3e18c66c83d6b2d0
+ .quad 0x3e1880ceb36fb30f
+ .quad 0x3e2495aac6ca17a4
+ .quad 0x3e2761db4210878c
+ .quad 0x3e2eb78e862bac2f
+ .quad 0x3e19b2cd75790dd9
+ .quad 0x3e2c55e5cbd3d50f
+ .quad 0x3db162a6617cc971
+ .quad 0x3dfdbeabaaa2e519
+ .quad 0x3e1652cb7150c647
+ .quad 0x3e39a11cb2cd2ee2
+ .quad 0x3e219d0ab1a28813
+ .quad 0x3e24bd9e80a41811
+ .quad 0x3e3214b596faa3df
+ .quad 0x3e303fea46980bb8
+ .quad 0x3e31c8ffa5fd28c7
+ .quad 0x3dce8f743bcd96c5
+ .quad 0x3dfd98c5395315c6
+ .quad 0x3e3996fa3ccfa7b2
+ .quad 0x3e1cd2af2ad13037
+ .quad 0x3e1d0da1bd17200e
+ .quad 0x3e3330410ba68b75
+ .quad 0x3df4f27a790e7c41
+ .quad 0x3e13956a86f6ff1b
+ .quad 0x3e2c6748723551d9
+ .quad 0x3e2500de9326cdfc
+ .quad 0x3e1086c848df1b59
+ .quad 0x3e04357ead6836ff
+ .quad 0x3e24832442408024
+ .quad 0x3e3d10da8154b13d
+ .quad 0x3e39e8ad68ec8260
+ .quad 0x3e3cfbf706abaf18
+ .quad 0x3e3fc56ac6326e23
+ .quad 0x3e39105e3185cf21
+ .quad 0x3e3d017fe5b19cc0
+ .quad 0x3e3d1f6b48dd13fe
+ .quad 0x3e20b63358a7e73a
+ .quad 0x3e263063028c211c
+ .quad 0x3e2e6a6886b09760
+ .quad 0x3e3c138bb891cd03
+ .quad 0x3e369f7722b7221a
+ .quad 0x3df57d8fac1a628c
+ .quad 0x3e3c55e5cbd3d50f
+ .quad 0x3e1552d2ff48fe2e
+ .quad 0x3e37b8b26ca431bc
+ .quad 0x3e292decdc1c5f6d
+ .quad 0x3e3abc7c551aaa8c
+ .quad 0x3e36b540731a354b
+ .quad 0x3e32d341036b89ef
+ .quad 0x3e4f9ab21a3a2e0f
+ .quad 0x3e239c871afb9fbd
+ .quad 0x3e3e6add2c81f640
+ .quad 0x3e435c95aa313f41
+ .quad 0x3e249d4582f6cc53
+ .quad 0x3e47574c1c07398f
+ .quad 0x3e4ba846dece9e8d
+ .quad 0x3e16999fafbc68e7
+ .quad 0x3e4c9145e51b0103
+ .quad 0x3e479ef2cb44850a
+ .quad 0x3e0beec73de11275
+ .quad 0x3e2ef4351af5a498
+ .quad 0x3e45713a493b4a50
+ .quad 0x3e45c23a61385992
+ .quad 0x3e42a88309f57299
+ .quad 0x3e4530faa9ac8ace
+ .quad 0x3e25fec2d792a758
+ .quad 0x3e35a517a71cbcd7
+ .quad 0x3e3707dc3e1cd9a3
+ .quad 0x3e3a1a9f8ef43049
+ .quad 0x3e4409d0276b3674
+ .quad 0x3e20e2f613e85bd9
+ .quad 0x3df0027433001e5f
+ .quad 0x3e35dde2836d3265
+ .quad 0x3e2300134d7aaf04
+ .quad 0x3e3cb7e0b42724f5
+ .quad 0x3e2d6e93167e6308
+ .quad 0x3e3d1569b1526adb
+ .quad 0x3e0e99fc338a1a41
+ .quad 0x3e4eb01394a11b1c
+ .quad 0x3e04f27a790e7c41
+ .quad 0x3e25ce3ca97b7af9
+ .quad 0x3e281f0f940ed857
+ .quad 0x3e4d36295d88857c
+ .quad 0x3e21aca1ec4af526
+ .quad 0x3e445743c7182726
+ .quad 0x3e23c491aead337e
+ .quad 0x3e3aef401a738931
+ .quad 0x3e21cede76092a29
+ .quad 0x3e4fba8f44f82bb4
+ .quad 0x3e446f5f7f3c3e1a
+ .quad 0x3e47055f86c9674b
+ .quad 0x3e4b41a92b6b6e1a
+ .quad 0x3e443d162e927628
+ .quad 0x3e4466174013f9b1
+ .quad 0x3e3b05096ad69c62
+ .quad 0x3e40b169150faa58
+ .quad 0x3e3cd98b1df85da7
+ .quad 0x3e468b507b0f8fa8
+ .quad 0x3e48422df57499ba
+ .quad 0x3e11351586970274
+ .quad 0x3e117e08acba92ee
+ .quad 0x3e26e04314dd0229
+ .quad 0x3e497f3097e56d1a
+ .quad 0x3e3356e655901286
+ .quad 0x3e0cb761457f94d6
+ .quad 0x3e39af67a85a9dac
+ .quad 0x3e453410931a909f
+ .quad 0x3e22c587206058f5
+ .quad 0x3e223bc358899c22
+ .quad 0x3e4d7bf8b6d223cb
+ .quad 0x3e47991ec5197ddb
+ .quad 0x3e4a79e6bb3a9219
+ .quad 0x3e3a4c43ed663ec5
+ .quad 0x3e461b5a1484f438
+ .quad 0x3e4b4e36f7ef0c3a
+ .quad 0x3e115f026acd0d1b
+ .quad 0x3e3f36b535cecf05
+ .quad 0x3e2ffb7fbf3eb5c6
+ .quad 0x3e3e6a6886b09760
+ .quad 0x3e3135eb27f5bbc3
+ .quad 0x3e470be7d6f6fa57
+ .quad 0x3e4ce43cc84ab338
+ .quad 0x3e4c01d7aac3bd91
+ .quad 0x3e45c58d07961060
+ .quad 0x3e3628bcf941456e
+ .quad 0x3e4c58b2a8461cd2
+ .quad 0x3e33071282fb989a
+ .quad 0x3e420dab6a80f09c
+ .quad 0x3e44f8d84c397b1e
+ .quad 0x3e40d0ee08599e48
+ .quad 0x3e1d68787e37da36
+ .quad 0x3e366187d591bafc
+ .quad 0x3e22346600bae772
+ .quad 0x3e390377d0d61b8e
+ .quad 0x3e4f5e0dd966b907
+ .quad 0x3e49023cb79a00e2
+ .quad 0x3e44e05158c28ad8
+ .quad 0x3e3bfa7b08b18ae4
+ .quad 0x3e4ef1e63db35f67
+ .quad 0x3e0ec2ae39493d4f
+ .quad 0x3e40afe930ab2fa0
+ .quad 0x3e225ff8a1810dd4
+ .quad 0x3e469743fb1a71a5
+ .quad 0x3e5f9cc676785571
+ .quad 0x3e5b524da4cbf982
+ .quad 0x3e5a4c8b381535b8
+ .quad 0x3e5839be809caf2c
+ .quad 0x3e50968a1cb82c13
+ .quad 0x3e5eae6a41723fb5
+ .quad 0x3e5d9c29a380a4db
+ .quad 0x3e4094aa0ada625e
+ .quad 0x3e5973ad6fc108ca
+ .quad 0x3e4747322fdbab97
+ .quad 0x3e593692fa9d4221
+ .quad 0x3e5c5a992dfbc7d9
+ .quad 0x3e4e1f33e102387a
+ .quad 0x3e464fbef14c048c
+ .quad 0x3e4490f513ca5e3b
+ .quad 0x3e37a6af4d4c799d
+ .quad 0x3e57574c1c07398f
+ .quad 0x3e57b133417f8c1c
+ .quad 0x3e5feb9e0c176514
+ .quad 0x3e419f25bb3172f7
+ .quad 0x3e45f68a7bbfb852
+ .quad 0x3e5ee278497929f1
+ .quad 0x3e5ccee006109d58
+ .quad 0x3e5ce081a07bd8b3
+ .quad 0x3e570e12981817b8
+ .quad 0x3e292ab6d93503d0
+ .quad 0x3e58cb7dd7c3b61e
+ .quad 0x3e4efafd0a0b78da
+ .quad 0x3e5e907267c4288e
+ .quad 0x3e5d31ef96780875
+ .quad 0x3e23430dfcd2ad50
+ .quad 0x3e344d88d75bc1f9
+ .quad 0x3e5bec0f055e04fc
+ .quad 0x3e5d85611590b9ad
+ .quad 0x3df320568e583229
+ .quad 0x3e5a891d1772f538
+ .quad 0x3e22edc9dabba74d
+ .quad 0x3e4b9009a1015086
+ .quad 0x3e52a12a8c5b1a19
+ .quad 0x3e3a7885f0fdac85
+ .quad 0x3e5f4ffcd43ac691
+ .quad 0x3e52243ae2640aad
+ .quad 0x3e546513299035d3
+ .quad 0x3e5b39c3a62dd725
+ .quad 0x3e5ba6dd40049f51
+ .quad 0x3e451d1ed7177409
+ .quad 0x3e5cb0f2fd7f5216
+ .quad 0x3e3ab150cd4e2213
+ .quad 0x3e5cfd7bf3193844
+ .quad 0x3e53fff8455f1dbd
+ .quad 0x3e5fee640b905fc9
+ .quad 0x3e54e2adf548084c
+ .quad 0x3e3b597adc1ecdd2
+ .quad 0x3e4345bd096d3a75
+ .quad 0x3e5101b9d2453c8b
+ .quad 0x3e508ce55cc8c979
+ .quad 0x3e5bbf017e595f71
+ .quad 0x3e37ce733bd393dc
+ .quad 0x3e233bb0a503f8a1
+ .quad 0x3e30e2f613e85bd9
+ .quad 0x3e5e67555a635b3c
+ .quad 0x3e2ea88df73d5e8b
+ .quad 0x3e3d17e03bda18a8
+ .quad 0x3e5b607d76044f7e
+ .quad 0x3e52adc4e71bc2fc
+ .quad 0x3e5f99dc7362d1d9
+ .quad 0x3e5473fa008e6a6a
+ .quad 0x3e2b75bb09cb0985
+ .quad 0x3e5ea04dd10b9aba
+ .quad 0x3e5802d0d6979674
+ .quad 0x3e174688ccd99094
+ .quad 0x3e496f16abb9df22
+ .quad 0x3e46e66df2aa374f
+ .quad 0x3e4e66525ea4550a
+ .quad 0x3e42d02f34f20cbd
+ .quad 0x3e46cfce65047188
+ .quad 0x3e39b78c842d58b8
+ .quad 0x3e4735e624c24bc9
+ .quad 0x3e47eba1f7dd1adf
+ .quad 0x3e586b3e59f65355
+ .quad 0x3e1ce38e637f1b4d
+ .quad 0x3e58d82ec919edc7
+ .quad 0x3e4c52648ddcfa37
+ .quad 0x3e52482ceae1ac12
+ .quad 0x3e55a312311aba4f
+ .quad 0x3e411e236329f225
+ .quad 0x3e5b48c8cd2f246c
+ .quad 0x3e6efa39ef35793c
+ .quad 0x0000000000000000
+
+.align 16
+.L__log_F_inv:
+ .quad 0x4000000000000000
+ .quad 0x3fffe01fe01fe020
+ .quad 0x3fffc07f01fc07f0
+ .quad 0x3fffa11caa01fa12
+ .quad 0x3fff81f81f81f820
+ .quad 0x3fff6310aca0dbb5
+ .quad 0x3fff44659e4a4271
+ .quad 0x3fff25f644230ab5
+ .quad 0x3fff07c1f07c1f08
+ .quad 0x3ffee9c7f8458e02
+ .quad 0x3ffecc07b301ecc0
+ .quad 0x3ffeae807aba01eb
+ .quad 0x3ffe9131abf0b767
+ .quad 0x3ffe741aa59750e4
+ .quad 0x3ffe573ac901e574
+ .quad 0x3ffe3a9179dc1a73
+ .quad 0x3ffe1e1e1e1e1e1e
+ .quad 0x3ffe01e01e01e01e
+ .quad 0x3ffde5d6e3f8868a
+ .quad 0x3ffdca01dca01dca
+ .quad 0x3ffdae6076b981db
+ .quad 0x3ffd92f2231e7f8a
+ .quad 0x3ffd77b654b82c34
+ .quad 0x3ffd5cac807572b2
+ .quad 0x3ffd41d41d41d41d
+ .quad 0x3ffd272ca3fc5b1a
+ .quad 0x3ffd0cb58f6ec074
+ .quad 0x3ffcf26e5c44bfc6
+ .quad 0x3ffcd85689039b0b
+ .quad 0x3ffcbe6d9601cbe7
+ .quad 0x3ffca4b3055ee191
+ .quad 0x3ffc8b265afb8a42
+ .quad 0x3ffc71c71c71c71c
+ .quad 0x3ffc5894d10d4986
+ .quad 0x3ffc3f8f01c3f8f0
+ .quad 0x3ffc26b5392ea01c
+ .quad 0x3ffc0e070381c0e0
+ .quad 0x3ffbf583ee868d8b
+ .quad 0x3ffbdd2b899406f7
+ .quad 0x3ffbc4fd65883e7b
+ .quad 0x3ffbacf914c1bad0
+ .quad 0x3ffb951e2b18ff23
+ .quad 0x3ffb7d6c3dda338b
+ .quad 0x3ffb65e2e3beee05
+ .quad 0x3ffb4e81b4e81b4f
+ .quad 0x3ffb37484ad806ce
+ .quad 0x3ffb2036406c80d9
+ .quad 0x3ffb094b31d922a4
+ .quad 0x3ffaf286bca1af28
+ .quad 0x3ffadbe87f94905e
+ .quad 0x3ffac5701ac5701b
+ .quad 0x3ffaaf1d2f87ebfd
+ .quad 0x3ffa98ef606a63be
+ .quad 0x3ffa82e65130e159
+ .quad 0x3ffa6d01a6d01a6d
+ .quad 0x3ffa574107688a4a
+ .quad 0x3ffa41a41a41a41a
+ .quad 0x3ffa2c2a87c51ca0
+ .quad 0x3ffa16d3f97a4b02
+ .quad 0x3ffa01a01a01a01a
+ .quad 0x3ff9ec8e951033d9
+ .quad 0x3ff9d79f176b682d
+ .quad 0x3ff9c2d14ee4a102
+ .quad 0x3ff9ae24ea5510da
+ .quad 0x3ff999999999999a
+ .quad 0x3ff9852f0d8ec0ff
+ .quad 0x3ff970e4f80cb872
+ .quad 0x3ff95cbb0be377ae
+ .quad 0x3ff948b0fcd6e9e0
+ .quad 0x3ff934c67f9b2ce6
+ .quad 0x3ff920fb49d0e229
+ .quad 0x3ff90d4f120190d5
+ .quad 0x3ff8f9c18f9c18fa
+ .quad 0x3ff8e6527af1373f
+ .quad 0x3ff8d3018d3018d3
+ .quad 0x3ff8bfce8062ff3a
+ .quad 0x3ff8acb90f6bf3aa
+ .quad 0x3ff899c0f601899c
+ .quad 0x3ff886e5f0abb04a
+ .quad 0x3ff87427bcc092b9
+ .quad 0x3ff8618618618618
+ .quad 0x3ff84f00c2780614
+ .quad 0x3ff83c977ab2bedd
+ .quad 0x3ff82a4a0182a4a0
+ .quad 0x3ff8181818181818
+ .quad 0x3ff8060180601806
+ .quad 0x3ff7f405fd017f40
+ .quad 0x3ff7e225515a4f1d
+ .quad 0x3ff7d05f417d05f4
+ .quad 0x3ff7beb3922e017c
+ .quad 0x3ff7ad2208e0ecc3
+ .quad 0x3ff79baa6bb6398b
+ .quad 0x3ff78a4c8178a4c8
+ .quad 0x3ff77908119ac60d
+ .quad 0x3ff767dce434a9b1
+ .quad 0x3ff756cac201756d
+ .quad 0x3ff745d1745d1746
+ .quad 0x3ff734f0c541fe8d
+ .quad 0x3ff724287f46debc
+ .quad 0x3ff713786d9c7c09
+ .quad 0x3ff702e05c0b8170
+ .quad 0x3ff6f26016f26017
+ .quad 0x3ff6e1f76b4337c7
+ .quad 0x3ff6d1a62681c861
+ .quad 0x3ff6c16c16c16c17
+ .quad 0x3ff6b1490aa31a3d
+ .quad 0x3ff6a13cd1537290
+ .quad 0x3ff691473a88d0c0
+ .quad 0x3ff6816816816817
+ .quad 0x3ff6719f3601671a
+ .quad 0x3ff661ec6a5122f9
+ .quad 0x3ff6524f853b4aa3
+ .quad 0x3ff642c8590b2164
+ .quad 0x3ff63356b88ac0de
+ .quad 0x3ff623fa77016240
+ .quad 0x3ff614b36831ae94
+ .quad 0x3ff6058160581606
+ .quad 0x3ff5f66434292dfc
+ .quad 0x3ff5e75bb8d015e7
+ .quad 0x3ff5d867c3ece2a5
+ .quad 0x3ff5c9882b931057
+ .quad 0x3ff5babcc647fa91
+ .quad 0x3ff5ac056b015ac0
+ .quad 0x3ff59d61f123ccaa
+ .quad 0x3ff58ed2308158ed
+ .quad 0x3ff5805601580560
+ .quad 0x3ff571ed3c506b3a
+ .quad 0x3ff56397ba7c52e2
+ .quad 0x3ff5555555555555
+ .quad 0x3ff54725e6bb82fe
+ .quad 0x3ff5390948f40feb
+ .quad 0x3ff52aff56a8054b
+ .quad 0x3ff51d07eae2f815
+ .quad 0x3ff50f22e111c4c5
+ .quad 0x3ff5015015015015
+ .quad 0x3ff4f38f62dd4c9b
+ .quad 0x3ff4e5e0a72f0539
+ .quad 0x3ff4d843bedc2c4c
+ .quad 0x3ff4cab88725af6e
+ .quad 0x3ff4bd3edda68fe1
+ .quad 0x3ff4afd6a052bf5b
+ .quad 0x3ff4a27fad76014a
+ .quad 0x3ff49539e3b2d067
+ .quad 0x3ff4880522014880
+ .quad 0x3ff47ae147ae147b
+ .quad 0x3ff46dce34596066
+ .quad 0x3ff460cbc7f5cf9a
+ .quad 0x3ff453d9e2c776ca
+ .quad 0x3ff446f86562d9fb
+ .quad 0x3ff43a2730abee4d
+ .quad 0x3ff42d6625d51f87
+ .quad 0x3ff420b5265e5951
+ .quad 0x3ff4141414141414
+ .quad 0x3ff40782d10e6566
+ .quad 0x3ff3fb013fb013fb
+ .quad 0x3ff3ee8f42a5af07
+ .quad 0x3ff3e22cbce4a902
+ .quad 0x3ff3d5d991aa75c6
+ .quad 0x3ff3c995a47babe7
+ .quad 0x3ff3bd60d9232955
+ .quad 0x3ff3b13b13b13b14
+ .quad 0x3ff3a524387ac822
+ .quad 0x3ff3991c2c187f63
+ .quad 0x3ff38d22d366088e
+ .quad 0x3ff3813813813814
+ .quad 0x3ff3755bd1c945ee
+ .quad 0x3ff3698df3de0748
+ .quad 0x3ff35dce5f9f2af8
+ .quad 0x3ff3521cfb2b78c1
+ .quad 0x3ff34679ace01346
+ .quad 0x3ff33ae45b57bcb2
+ .quad 0x3ff32f5ced6a1dfa
+ .quad 0x3ff323e34a2b10bf
+ .quad 0x3ff3187758e9ebb6
+ .quad 0x3ff30d190130d190
+ .quad 0x3ff301c82ac40260
+ .quad 0x3ff2f684bda12f68
+ .quad 0x3ff2eb4ea1fed14b
+ .quad 0x3ff2e025c04b8097
+ .quad 0x3ff2d50a012d50a0
+ .quad 0x3ff2c9fb4d812ca0
+ .quad 0x3ff2bef98e5a3711
+ .quad 0x3ff2b404ad012b40
+ .quad 0x3ff2a91c92f3c105
+ .quad 0x3ff29e4129e4129e
+ .quad 0x3ff293725bb804a5
+ .quad 0x3ff288b01288b013
+ .quad 0x3ff27dfa38a1ce4d
+ .quad 0x3ff27350b8812735
+ .quad 0x3ff268b37cd60127
+ .quad 0x3ff25e22708092f1
+ .quad 0x3ff2539d7e9177b2
+ .quad 0x3ff2492492492492
+ .quad 0x3ff23eb79717605b
+ .quad 0x3ff23456789abcdf
+ .quad 0x3ff22a0122a0122a
+ .quad 0x3ff21fb78121fb78
+ .quad 0x3ff21579804855e6
+ .quad 0x3ff20b470c67c0d9
+ .quad 0x3ff2012012012012
+ .quad 0x3ff1f7047dc11f70
+ .quad 0x3ff1ecf43c7fb84c
+ .quad 0x3ff1e2ef3b3fb874
+ .quad 0x3ff1d8f5672e4abd
+ .quad 0x3ff1cf06ada2811d
+ .quad 0x3ff1c522fc1ce059
+ .quad 0x3ff1bb4a4046ed29
+ .quad 0x3ff1b17c67f2bae3
+ .quad 0x3ff1a7b9611a7b96
+ .quad 0x3ff19e0119e0119e
+ .quad 0x3ff19453808ca29c
+ .quad 0x3ff18ab083902bdb
+ .quad 0x3ff1811811811812
+ .quad 0x3ff1778a191bd684
+ .quad 0x3ff16e0689427379
+ .quad 0x3ff1648d50fc3201
+ .quad 0x3ff15b1e5f75270d
+ .quad 0x3ff151b9a3fdd5c9
+ .quad 0x3ff1485f0e0acd3b
+ .quad 0x3ff13f0e8d344724
+ .quad 0x3ff135c81135c811
+ .quad 0x3ff12c8b89edc0ac
+ .quad 0x3ff12358e75d3033
+ .quad 0x3ff11a3019a74826
+ .quad 0x3ff1111111111111
+ .quad 0x3ff107fbbe011080
+ .quad 0x3ff0fef010fef011
+ .quad 0x3ff0f5edfab325a2
+ .quad 0x3ff0ecf56be69c90
+ .quad 0x3ff0e40655826011
+ .quad 0x3ff0db20a88f4696
+ .quad 0x3ff0d24456359e3a
+ .quad 0x3ff0c9714fbcda3b
+ .quad 0x3ff0c0a7868b4171
+ .quad 0x3ff0b7e6ec259dc8
+ .quad 0x3ff0af2f722eecb5
+ .quad 0x3ff0a6810a6810a7
+ .quad 0x3ff09ddba6af8360
+ .quad 0x3ff0953f39010954
+ .quad 0x3ff08cabb37565e2
+ .quad 0x3ff0842108421084
+ .quad 0x3ff07b9f29b8eae2
+ .quad 0x3ff073260a47f7c6
+ .quad 0x3ff06ab59c7912fb
+ .quad 0x3ff0624dd2f1a9fc
+ .quad 0x3ff059eea0727586
+ .quad 0x3ff05197f7d73404
+ .quad 0x3ff04949cc1664c5
+ .quad 0x3ff0410410410410
+ .quad 0x3ff038c6b78247fc
+ .quad 0x3ff03091b51f5e1a
+ .quad 0x3ff02864fc7729e9
+ .quad 0x3ff0204081020408
+ .quad 0x3ff0182436517a37
+ .quad 0x3ff0101010101010
+ .quad 0x3ff0080402010080
+ .quad 0x3ff0000000000000
+ .quad 0x0000000000000000
+
+#endif
diff --git a/src/gas/log10.S b/src/gas/log10.S
new file mode 100644
index 0000000..90522ef
--- /dev/null
+++ b/src/gas/log10.S
@@ -0,0 +1,1146 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# log10.S
+#
+# An implementation of the log10 libm function.
+#
+# Prototype:
+#
+# double log10(double x);
+#
+
+#
+# Algorithm:
+# Similar to one presnted in log.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(log10)
+#define fname_special _log10_special@PLT
+
+
+# local variable storage offsets
+.equ p_temp, 0x0
+.equ stack_size, 0x18
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ sub $stack_size, %rsp
+
+ # compute exponent part
+ xor %rax, %rax
+ movdqa %xmm0, %xmm3
+ movsd %xmm0, %xmm4
+ psrlq $52, %xmm3
+ movd %xmm0, %rax
+ psubq .L__mask_1023(%rip), %xmm3
+ movdqa %xmm0, %xmm2
+ cvtdq2pd %xmm3, %xmm6 # xexp
+
+ # NaN or inf
+ movdqa %xmm0, %xmm5
+ andpd .L__real_inf(%rip), %xmm5
+ comisd .L__real_inf(%rip), %xmm5
+ je .L__x_is_inf_or_nan
+
+ # check for negative numbers or zero
+ xorpd %xmm5, %xmm5
+ comisd %xmm5, %xmm0
+ jbe .L__x_is_zero_or_neg
+
+ pand .L__real_mant(%rip), %xmm2
+ subsd .L__real_one(%rip), %xmm4
+
+ comisd .L__mask_1023_f(%rip), %xmm6
+ je .L__denormal_adjust
+
+.L__continue_common:
+
+ # compute index into the log tables
+ mov %rax, %r9
+ and .L__mask_mant_all8(%rip), %rax
+ and .L__mask_mant9(%rip), %r9
+ shl $1, %r9
+ add %r9, %rax
+ mov %rax, p_temp(%rsp)
+
+ # near one codepath
+ andpd .L__real_notsign(%rip), %xmm4
+ comisd .L__real_threshold(%rip), %xmm4
+ jb .L__near_one
+
+ # F, Y
+ movsd p_temp(%rsp), %xmm1
+ shr $44, %rax
+ por .L__real_half(%rip), %xmm2
+ por .L__real_half(%rip), %xmm1
+ lea .L__log_F_inv(%rip), %r9
+
+ # f = F - Y, r = f * inv
+ subsd %xmm2, %xmm1
+ mulsd (%r9,%rax,8), %xmm1
+
+ movsd %xmm1, %xmm2
+ movsd %xmm1, %xmm0
+ lea .L__log_256_lead(%rip), %r9
+
+ # poly
+ movsd .L__real_1_over_6(%rip), %xmm3
+ movsd .L__real_1_over_3(%rip), %xmm1
+ mulsd %xmm2, %xmm3
+ mulsd %xmm2, %xmm1
+ mulsd %xmm2, %xmm0
+ movsd %xmm0, %xmm4
+ addsd .L__real_1_over_5(%rip), %xmm3
+ addsd .L__real_1_over_2(%rip), %xmm1
+ mulsd %xmm0, %xmm4
+ mulsd %xmm2, %xmm3
+ mulsd %xmm0, %xmm1
+ addsd .L__real_1_over_4(%rip), %xmm3
+ addsd %xmm2, %xmm1
+ mulsd %xmm4, %xmm3
+ addsd %xmm3, %xmm1
+
+ mulsd .L__real_log10_e(%rip), %xmm1
+
+ # m*log(10) + log10(G) - poly
+ movsd .L__real_log10_2_tail(%rip), %xmm5
+ mulsd %xmm6, %xmm5
+ subsd %xmm1, %xmm5
+
+ movsd (%r9,%rax,8), %xmm0
+ lea .L__log_256_tail(%rip), %rdx
+ movsd (%rdx,%rax,8), %xmm2
+ addsd %xmm5, %xmm2
+
+ movsd .L__real_log10_2_lead(%rip), %xmm4
+ mulsd %xmm6, %xmm4
+ addsd %xmm4, %xmm0
+
+ addsd %xmm2, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__near_one:
+
+ # r = x - 1.0
+ movsd .L__real_two(%rip), %xmm2
+ subsd .L__real_one(%rip), %xmm0 # r
+
+ addsd %xmm0, %xmm2
+ movsd %xmm0, %xmm1
+ divsd %xmm2, %xmm1 # r/(2+r) = u/2
+
+ movsd .L__real_ca2(%rip), %xmm4
+ movsd .L__real_ca4(%rip), %xmm5
+
+ movsd %xmm0, %xmm6
+ mulsd %xmm1, %xmm6 # correction
+
+ addsd %xmm1, %xmm1 # u
+ movsd %xmm1, %xmm2
+
+ mulsd %xmm1, %xmm2 # u^2
+
+ mulsd %xmm2, %xmm4
+ mulsd %xmm2, %xmm5
+
+ addsd .L__real_ca1(%rip), %xmm4
+ addsd .L__real_ca3(%rip), %xmm5
+
+ mulsd %xmm1, %xmm2 # u^3
+ mulsd %xmm2, %xmm4
+
+ mulsd %xmm2, %xmm2
+ mulsd %xmm1, %xmm2 # u^7
+ mulsd %xmm2, %xmm5
+
+ addsd %xmm5, %xmm4
+ subsd %xmm6, %xmm4
+
+ movdqa %xmm0, %xmm3
+ pand .L__mask_lower(%rip), %xmm3
+ subsd %xmm3, %xmm0
+ addsd %xmm0, %xmm4
+
+ movsd %xmm3, %xmm0
+ movsd %xmm4, %xmm1
+
+ mulsd .L__real_log10_e_tail(%rip), %xmm4
+ mulsd .L__real_log10_e_tail(%rip), %xmm0
+ mulsd .L__real_log10_e_lead(%rip), %xmm1
+ mulsd .L__real_log10_e_lead(%rip), %xmm3
+
+ addsd %xmm4, %xmm0
+ addsd %xmm1, %xmm0
+ addsd %xmm3, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.L__denormal_adjust:
+ por .L__real_one(%rip), %xmm2
+ subsd .L__real_one(%rip), %xmm2
+ movsd %xmm2, %xmm5
+ pand .L__real_mant(%rip), %xmm2
+ movd %xmm2, %rax
+ psrlq $52, %xmm5
+ psubd .L__mask_2045(%rip), %xmm5
+ cvtdq2pd %xmm5, %xmm6
+ jmp .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+ jne .L__x_is_neg
+
+ movsd .L__real_ninf(%rip), %xmm1
+ mov .L__flag_x_zero(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+ movsd .L__real_qnan(%rip), %xmm1
+ mov .L__flag_x_neg(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+ cmp .L__real_inf(%rip), %rax
+ je .L__finish
+
+ cmp .L__real_ninf(%rip), %rax
+ je .L__x_is_neg
+
+ mov .L__real_qnanbit(%rip), %r9
+ and %rax, %r9
+ jnz .L__finish
+
+ or .L__real_qnanbit(%rip), %rax
+ movd %rax, %xmm1
+ mov .L__flag_x_nan(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__finish:
+ add $stack_size, %rsp
+ ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero: .long 00000001
+.L__flag_x_neg: .long 00000002
+.L__flag_x_nan: .long 00000003
+
+.align 16
+
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0000000000000000
+.L__real_inf: .quad 0x7ff0000000000000 # +inf
+ .quad 0x0000000000000000
+.L__real_qnan: .quad 0x7ff8000000000000 # qNaN
+ .quad 0x0000000000000000
+.L__real_qnanbit: .quad 0x0008000000000000
+ .quad 0x0000000000000000
+.L__real_mant: .quad 0x000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000000000000000
+.L__mask_1023: .quad 0x00000000000003ff
+ .quad 0x0000000000000000
+.L__mask_001: .quad 0x0000000000000001
+ .quad 0x0000000000000000
+
+.L__mask_mant_all8: .quad 0x000ff00000000000
+ .quad 0x0000000000000000
+.L__mask_mant9: .quad 0x0000080000000000
+ .quad 0x0000000000000000
+
+.L__real_log10_e: .quad 0x3fdbcb7b1526e50e
+ .quad 0x0000000000000000
+
+.L__real_log10_e_lead: .quad 0x3fdbcb7800000000 # log10e_lead 4.34293746948242187500e-01
+ .quad 0x0000000000000000
+.L__real_log10_e_tail: .quad 0x3ea8a93728719535 # log10e_tail 7.3495500964015109100644e-7
+ .quad 0x0000000000000000
+
+.L__real_log10_2_lead: .quad 0x3fd3441350000000
+ .quad 0x0000000000000000
+.L__real_log10_2_tail: .quad 0x3e03ef3fde623e25
+ .quad 0x0000000000000000
+
+
+
+
+.L__real_two: .quad 0x4000000000000000 # 2
+ .quad 0x0000000000000000
+
+.L__real_one: .quad 0x3ff0000000000000 # 1
+ .quad 0x0000000000000000
+
+.L__real_half: .quad 0x3fe0000000000000 # 1/2
+ .quad 0x0000000000000000
+
+.L__mask_100: .quad 0x0000000000000100
+ .quad 0x0000000000000000
+
+.L__real_1_over_512: .quad 0x3f60000000000000
+ .quad 0x0000000000000000
+
+.L__real_1_over_2: .quad 0x3fe0000000000000
+ .quad 0x0000000000000000
+.L__real_1_over_3: .quad 0x3fd5555555555555
+ .quad 0x0000000000000000
+.L__real_1_over_4: .quad 0x3fd0000000000000
+ .quad 0x0000000000000000
+.L__real_1_over_5: .quad 0x3fc999999999999a
+ .quad 0x0000000000000000
+.L__real_1_over_6: .quad 0x3fc5555555555555
+ .quad 0x0000000000000000
+
+.L__mask_1023_f: .quad 0x0c08ff80000000000
+ .quad 0x0000000000000000
+
+.L__mask_2045: .quad 0x00000000000007fd
+ .quad 0x0000000000000000
+
+.L__real_threshold: .quad 0x3fb0000000000000 # .0625
+ .quad 0x0000000000000000
+
+.L__real_notsign: .quad 0x7ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x0000000000000000
+
+.L__real_ca1: .quad 0x3fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x0000000000000000
+.L__real_ca2: .quad 0x3f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x0000000000000000
+.L__real_ca3: .quad 0x3f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x0000000000000000
+.L__real_ca4: .quad 0x3f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x0000000000000000
+
+.L__mask_lower: .quad 0x0ffffffff00000000
+ .quad 0x0000000000000000
+
+.align 16
+.L__log_256_lead:
+ .quad 0x0000000000000000
+ .quad 0x3f5bbd9e90000000
+ .quad 0x3f6bafd470000000
+ .quad 0x3f74b99560000000
+ .quad 0x3f7b9476a0000000
+ .quad 0x3f81344da0000000
+ .quad 0x3f849b0850000000
+ .quad 0x3f87fe71c0000000
+ .quad 0x3f8b5e9080000000
+ .quad 0x3f8ebb6af0000000
+ .quad 0x3f910a83a0000000
+ .quad 0x3f92b5b5e0000000
+ .quad 0x3f945f4f50000000
+ .quad 0x3f96075300000000
+ .quad 0x3f97adc3d0000000
+ .quad 0x3f9952a4f0000000
+ .quad 0x3f9af5f920000000
+ .quad 0x3f9c97c370000000
+ .quad 0x3f9e3806a0000000
+ .quad 0x3f9fd6c5b0000000
+ .quad 0x3fa0ba01a0000000
+ .quad 0x3fa187e120000000
+ .quad 0x3fa25502c0000000
+ .quad 0x3fa32167c0000000
+ .quad 0x3fa3ed1190000000
+ .quad 0x3fa4b80180000000
+ .quad 0x3fa58238e0000000
+ .quad 0x3fa64bb910000000
+ .quad 0x3fa7148340000000
+ .quad 0x3fa7dc98c0000000
+ .quad 0x3fa8a3fad0000000
+ .quad 0x3fa96aaac0000000
+ .quad 0x3faa30a9d0000000
+ .quad 0x3faaf5f920000000
+ .quad 0x3fabba9a00000000
+ .quad 0x3fac7e8d90000000
+ .quad 0x3fad41d510000000
+ .quad 0x3fae0471a0000000
+ .quad 0x3faec66470000000
+ .quad 0x3faf87aeb0000000
+ .quad 0x3fb02428c0000000
+ .quad 0x3fb08426f0000000
+ .quad 0x3fb0e3d290000000
+ .quad 0x3fb1432c30000000
+ .quad 0x3fb1a23440000000
+ .quad 0x3fb200eb60000000
+ .quad 0x3fb25f5210000000
+ .quad 0x3fb2bd68e0000000
+ .quad 0x3fb31b3050000000
+ .quad 0x3fb378a8e0000000
+ .quad 0x3fb3d5d330000000
+ .quad 0x3fb432afa0000000
+ .quad 0x3fb48f3ed0000000
+ .quad 0x3fb4eb8120000000
+ .quad 0x3fb5477730000000
+ .quad 0x3fb5a32160000000
+ .quad 0x3fb5fe8040000000
+ .quad 0x3fb6599440000000
+ .quad 0x3fb6b45df0000000
+ .quad 0x3fb70eddb0000000
+ .quad 0x3fb7691400000000
+ .quad 0x3fb7c30160000000
+ .quad 0x3fb81ca630000000
+ .quad 0x3fb8760300000000
+ .quad 0x3fb8cf1830000000
+ .quad 0x3fb927e640000000
+ .quad 0x3fb9806d90000000
+ .quad 0x3fb9d8aea0000000
+ .quad 0x3fba30a9d0000000
+ .quad 0x3fba885fa0000000
+ .quad 0x3fbadfd070000000
+ .quad 0x3fbb36fcb0000000
+ .quad 0x3fbb8de4d0000000
+ .quad 0x3fbbe48930000000
+ .quad 0x3fbc3aea40000000
+ .quad 0x3fbc910870000000
+ .quad 0x3fbce6e410000000
+ .quad 0x3fbd3c7da0000000
+ .quad 0x3fbd91d580000000
+ .quad 0x3fbde6ec00000000
+ .quad 0x3fbe3bc1a0000000
+ .quad 0x3fbe9056b0000000
+ .quad 0x3fbee4aba0000000
+ .quad 0x3fbf38c0c0000000
+ .quad 0x3fbf8c9680000000
+ .quad 0x3fbfe02d30000000
+ .quad 0x3fc019c2a0000000
+ .quad 0x3fc0434f70000000
+ .quad 0x3fc06cbd60000000
+ .quad 0x3fc0960c80000000
+ .quad 0x3fc0bf3d00000000
+ .quad 0x3fc0e84f10000000
+ .quad 0x3fc11142f0000000
+ .quad 0x3fc13a18a0000000
+ .quad 0x3fc162d080000000
+ .quad 0x3fc18b6a90000000
+ .quad 0x3fc1b3e710000000
+ .quad 0x3fc1dc4630000000
+ .quad 0x3fc2048810000000
+ .quad 0x3fc22cace0000000
+ .quad 0x3fc254b4d0000000
+ .quad 0x3fc27c9ff0000000
+ .quad 0x3fc2a46e80000000
+ .quad 0x3fc2cc20b0000000
+ .quad 0x3fc2f3b690000000
+ .quad 0x3fc31b3050000000
+ .quad 0x3fc3428e20000000
+ .quad 0x3fc369d020000000
+ .quad 0x3fc390f680000000
+ .quad 0x3fc3b80160000000
+ .quad 0x3fc3def0e0000000
+ .quad 0x3fc405c530000000
+ .quad 0x3fc42c7e70000000
+ .quad 0x3fc4531cd0000000
+ .quad 0x3fc479a070000000
+ .quad 0x3fc4a00970000000
+ .quad 0x3fc4c65800000000
+ .quad 0x3fc4ec8c30000000
+ .quad 0x3fc512a640000000
+ .quad 0x3fc538a630000000
+ .quad 0x3fc55e8c50000000
+ .quad 0x3fc5845890000000
+ .quad 0x3fc5aa0b40000000
+ .quad 0x3fc5cfa470000000
+ .quad 0x3fc5f52440000000
+ .quad 0x3fc61a8ad0000000
+ .quad 0x3fc63fd850000000
+ .quad 0x3fc6650cd0000000
+ .quad 0x3fc68a2880000000
+ .quad 0x3fc6af2b80000000
+ .quad 0x3fc6d415e0000000
+ .quad 0x3fc6f8e7d0000000
+ .quad 0x3fc71da170000000
+ .quad 0x3fc74242e0000000
+ .quad 0x3fc766cc40000000
+ .quad 0x3fc78b3da0000000
+ .quad 0x3fc7af9730000000
+ .quad 0x3fc7d3d910000000
+ .quad 0x3fc7f80350000000
+ .quad 0x3fc81c1620000000
+ .quad 0x3fc8401190000000
+ .quad 0x3fc863f5c0000000
+ .quad 0x3fc887c2e0000000
+ .quad 0x3fc8ab7900000000
+ .quad 0x3fc8cf1830000000
+ .quad 0x3fc8f2a0a0000000
+ .quad 0x3fc9161270000000
+ .quad 0x3fc9396db0000000
+ .quad 0x3fc95cb280000000
+ .quad 0x3fc97fe100000000
+ .quad 0x3fc9a2f950000000
+ .quad 0x3fc9c5fb70000000
+ .quad 0x3fc9e8e7b0000000
+ .quad 0x3fca0bbdf0000000
+ .quad 0x3fca2e7e80000000
+ .quad 0x3fca512960000000
+ .quad 0x3fca73bea0000000
+ .quad 0x3fca963e70000000
+ .quad 0x3fcab8a8f0000000
+ .quad 0x3fcadafe20000000
+ .quad 0x3fcafd3e30000000
+ .quad 0x3fcb1f6930000000
+ .quad 0x3fcb417f40000000
+ .quad 0x3fcb638070000000
+ .quad 0x3fcb856cf0000000
+ .quad 0x3fcba744b0000000
+ .quad 0x3fcbc907f0000000
+ .quad 0x3fcbeab6c0000000
+ .quad 0x3fcc0c5130000000
+ .quad 0x3fcc2dd750000000
+ .quad 0x3fcc4f4950000000
+ .quad 0x3fcc70a740000000
+ .quad 0x3fcc91f130000000
+ .quad 0x3fccb32740000000
+ .quad 0x3fccd44980000000
+ .quad 0x3fccf55810000000
+ .quad 0x3fcd165300000000
+ .quad 0x3fcd373a60000000
+ .quad 0x3fcd580e60000000
+ .quad 0x3fcd78cf00000000
+ .quad 0x3fcd997c70000000
+ .quad 0x3fcdba16a0000000
+ .quad 0x3fcdda9dd0000000
+ .quad 0x3fcdfb11f0000000
+ .quad 0x3fce1b7330000000
+ .quad 0x3fce3bc1a0000000
+ .quad 0x3fce5bfd50000000
+ .quad 0x3fce7c2660000000
+ .quad 0x3fce9c3ce0000000
+ .quad 0x3fcebc40e0000000
+ .quad 0x3fcedc3280000000
+ .quad 0x3fcefc11d0000000
+ .quad 0x3fcf1bdee0000000
+ .quad 0x3fcf3b99d0000000
+ .quad 0x3fcf5b42a0000000
+ .quad 0x3fcf7ad980000000
+ .quad 0x3fcf9a5e70000000
+ .quad 0x3fcfb9d190000000
+ .quad 0x3fcfd932f0000000
+ .quad 0x3fcff882a0000000
+ .quad 0x3fd00be050000000
+ .quad 0x3fd01b76a0000000
+ .quad 0x3fd02b0430000000
+ .quad 0x3fd03a8910000000
+ .quad 0x3fd04a0540000000
+ .quad 0x3fd05978e0000000
+ .quad 0x3fd068e3f0000000
+ .quad 0x3fd0784670000000
+ .quad 0x3fd087a080000000
+ .quad 0x3fd096f210000000
+ .quad 0x3fd0a63b30000000
+ .quad 0x3fd0b57bf0000000
+ .quad 0x3fd0c4b450000000
+ .quad 0x3fd0d3e460000000
+ .quad 0x3fd0e30c30000000
+ .quad 0x3fd0f22bc0000000
+ .quad 0x3fd1014310000000
+ .quad 0x3fd1105240000000
+ .quad 0x3fd11f5940000000
+ .quad 0x3fd12e5830000000
+ .quad 0x3fd13d4f00000000
+ .quad 0x3fd14c3dd0000000
+ .quad 0x3fd15b24a0000000
+ .quad 0x3fd16a0370000000
+ .quad 0x3fd178da50000000
+ .quad 0x3fd187a940000000
+ .quad 0x3fd1967060000000
+ .quad 0x3fd1a52fa0000000
+ .quad 0x3fd1b3e710000000
+ .quad 0x3fd1c296c0000000
+ .quad 0x3fd1d13eb0000000
+ .quad 0x3fd1dfdef0000000
+ .quad 0x3fd1ee7770000000
+ .quad 0x3fd1fd0860000000
+ .quad 0x3fd20b91a0000000
+ .quad 0x3fd21a1350000000
+ .quad 0x3fd2288d70000000
+ .quad 0x3fd2370010000000
+ .quad 0x3fd2456b30000000
+ .quad 0x3fd253ced0000000
+ .quad 0x3fd2622b00000000
+ .quad 0x3fd2707fd0000000
+ .quad 0x3fd27ecd40000000
+ .quad 0x3fd28d1360000000
+ .quad 0x3fd29b5220000000
+ .quad 0x3fd2a989a0000000
+ .quad 0x3fd2b7b9e0000000
+ .quad 0x3fd2c5e2e0000000
+ .quad 0x3fd2d404b0000000
+ .quad 0x3fd2e21f50000000
+ .quad 0x3fd2f032c0000000
+ .quad 0x3fd2fe3f20000000
+ .quad 0x3fd30c4470000000
+ .quad 0x3fd31a42b0000000
+ .quad 0x3fd32839e0000000
+ .quad 0x3fd3362a10000000
+ .quad 0x3fd3441350000000
+
+.align 16
+.L__log_256_tail:
+ .quad 0x0000000000000000
+ .quad 0x3db20abc22b2208f
+ .quad 0x3db10f69332e0dd4
+ .quad 0x3dce950de87ed257
+ .quad 0x3dd3f3443b626d69
+ .quad 0x3df45aeaa5363e57
+ .quad 0x3dc443683ce1bf0b
+ .quad 0x3df989cd60c6a511
+ .quad 0x3dfd626f201f2e9f
+ .quad 0x3de94f8bb8dabdcd
+ .quad 0x3e0088d8ef423015
+ .quad 0x3e080413a62b79ad
+ .quad 0x3e059717c0eed3c4
+ .quad 0x3dad4a77add44902
+ .quad 0x3e0e763ff037300e
+ .quad 0x3de162d74706f6c3
+ .quad 0x3e0601cc1f4dbc14
+ .quad 0x3deaf3e051f6e5bf
+ .quad 0x3e097a0b1e1af3eb
+ .quad 0x3dc0a38970c002c7
+ .quad 0x3e102e000057c751
+ .quad 0x3e155b00eecd6e0e
+ .quad 0x3ddf86297003b5af
+ .quad 0x3e1057b9b336a36d
+ .quad 0x3e134bc84a06ea4f
+ .quad 0x3e1643da9ea1bcad
+ .quad 0x3e1d66a7b4f7ea2a
+ .quad 0x3df6b2e038f7fcef
+ .quad 0x3df3e954c670f088
+ .quad 0x3e047209093acab3
+ .quad 0x3e1d708fe7275da7
+ .quad 0x3e1fdf9e7771b9e7
+ .quad 0x3e0827bfa70a0660
+ .quad 0x3e1601cc1f4dbc14
+ .quad 0x3e0637f6106a5e5b
+ .quad 0x3e126a13f17c624b
+ .quad 0x3e093eb2ce80623a
+ .quad 0x3e1430d1e91594de
+ .quad 0x3e1d6b10108fa031
+ .quad 0x3e16879c0bbaf241
+ .quad 0x3dff08015ea6bc2b
+ .quad 0x3e29b63dcdc6676c
+ .quad 0x3e2b022cbcc4ab2c
+ .quad 0x3df917d07ddd6544
+ .quad 0x3e1540605703379e
+ .quad 0x3e0cd18b947a1b60
+ .quad 0x3e17ad65277ca97e
+ .quad 0x3e11884dc59f5fa9
+ .quad 0x3e1711c46006d082
+ .quad 0x3e2f092e3c3108f8
+ .quad 0x3e1714c5e32be13a
+ .quad 0x3e26bba7fd734f9a
+ .quad 0x3dfdf48fb5e08483
+ .quad 0x3e232f9bc74d0b95
+ .quad 0x3df973e848790c13
+ .quad 0x3e1eccbc08c6586e
+ .quad 0x3e2115e9f9524a98
+ .quad 0x3e2f1740593131b8
+ .quad 0x3e1bcf8b25643835
+ .quad 0x3e1f5fa81d8bed80
+ .quad 0x3e244a4df929d9e4
+ .quad 0x3e129820d8220c94
+ .quad 0x3e2a0b489304e309
+ .quad 0x3e1f4d56aba665fe
+ .quad 0x3e210c9019365163
+ .quad 0x3df80f78fe592736
+ .quad 0x3e10528825c81cca
+ .quad 0x3de095537d6d746a
+ .quad 0x3e1827bfa70a0660
+ .quad 0x3e06b0a8ec45933c
+ .quad 0x3e105af81bf5dba9
+ .quad 0x3e17e2fa2655d515
+ .quad 0x3e0d59ecbfaee4bf
+ .quad 0x3e1d8b2fda683fa3
+ .quad 0x3e24b8ddfd3a3737
+ .quad 0x3e13827e61ae1204
+ .quad 0x3e2c8c7b49e90f9f
+ .quad 0x3e29eaf01597591d
+ .quad 0x3e19aaa66e317b36
+ .quad 0x3e2e725609720655
+ .quad 0x3e261c33fc7aac54
+ .quad 0x3e29662bcf61a252
+ .quad 0x3e1843c811c42730
+ .quad 0x3e2064bb0b5acb36
+ .quad 0x3e0a340c842701a4
+ .quad 0x3e1a8e55b58f79d6
+ .quad 0x3de92d219c5e9d9a
+ .quad 0x3e3f63e60d7ffd6a
+ .quad 0x3e2e9b0ed9516314
+ .quad 0x3e2923901962350c
+ .quad 0x3e326f8838785e81
+ .quad 0x3e3b5b6a4caba6af
+ .quad 0x3df0226adc8e761c
+ .quad 0x3e3c4ad7313a1aed
+ .quad 0x3e1564e87c738d17
+ .quad 0x3e338fecf18a6618
+ .quad 0x3e3d929ef5777666
+ .quad 0x3e39483bf08da0b8
+ .quad 0x3e3bdd0eeeaa5826
+ .quad 0x3e39c4dd590237ba
+ .quad 0x3e1af3e9e0ebcac7
+ .quad 0x3e35ce5382270dac
+ .quad 0x3e394f74532ab9ba
+ .quad 0x3e07342795888654
+ .quad 0x3e0c5a000be34bf0
+ .quad 0x3e2711c46006d082
+ .quad 0x3e250025b4ed8cf8
+ .quad 0x3e2ed18bcef2d2a0
+ .quad 0x3e21282e0c0a7554
+ .quad 0x3e0d70f33359a7ca
+ .quad 0x3e2b7f7e13a84025
+ .quad 0x3e33306ec321891e
+ .quad 0x3e3fc7f8038b7550
+ .quad 0x3e3eb0358cd71d64
+ .quad 0x3e3a76c822859474
+ .quad 0x3e3d0ec652de86e3
+ .quad 0x3e2fa4cce08658af
+ .quad 0x3e3b84a2d2c00a9e
+ .quad 0x3e20a5b0f2c25bd1
+ .quad 0x3e3dd660225bf699
+ .quad 0x3e08b10f859bf037
+ .quad 0x3e3e8823b590cbe1
+ .quad 0x3e361311f31e96f6
+ .quad 0x3e2e1f875ca20f9a
+ .quad 0x3e2c95724939b9a5
+ .quad 0x3e3805957a3e58e2
+ .quad 0x3e2ff126ea9f0334
+ .quad 0x3e3953f5598e5609
+ .quad 0x3e36c16ff856c448
+ .quad 0x3e24cb220ff261f4
+ .quad 0x3e35e120d53d53a2
+ .quad 0x3e3a527f6189f256
+ .quad 0x3e3856fcffd49c0f
+ .quad 0x3e300c2e8228d7da
+ .quad 0x3df113d09444dfe0
+ .quad 0x3e2510630eea59a6
+ .quad 0x3e262e780f32d711
+ .quad 0x3ded3ed91a10f8cf
+ .quad 0x3e23654a7e4bcd85
+ .quad 0x3e055b784980ad21
+ .quad 0x3e212f2dd4b16e64
+ .quad 0x3e37c4add939f50c
+ .quad 0x3e281784627180fc
+ .quad 0x3dea5162c7e14961
+ .quad 0x3e310c9019365163
+ .quad 0x3e373c4d2ba17688
+ .quad 0x3e2ae8a5e0e93d81
+ .quad 0x3e2ab0c6f01621af
+ .quad 0x3e301e8b74dd5b66
+ .quad 0x3e2d206fecbb5494
+ .quad 0x3df0b48b724fcc00
+ .quad 0x3e3f831f0b61e229
+ .quad 0x3df81a97c407bcaf
+ .quad 0x3e3e286c1ccbb7aa
+ .quad 0x3e28630b49220a93
+ .quad 0x3dff0b15c1a22c5c
+ .quad 0x3e355445e71c0946
+ .quad 0x3e3be630f8066d85
+ .quad 0x3e2599dff0d96c39
+ .quad 0x3e36cc85b18fb081
+ .quad 0x3e34476d001ea8c8
+ .quad 0x3e373f889e16d31f
+ .quad 0x3e3357100d792a87
+ .quad 0x3e3bd179ae6101f6
+ .quad 0x3e0ca31056c3f6e2
+ .quad 0x3e3d2870629c08fb
+ .quad 0x3e3aba3880d2673f
+ .quad 0x3e2c3633cb297da6
+ .quad 0x3e21843899efea02
+ .quad 0x3e3bccc99d2008e6
+ .quad 0x3e38000544bdd350
+ .quad 0x3e2b91c226606ae1
+ .quad 0x3e2a7adf26b62bdf
+ .quad 0x3e18764fc8826ec9
+ .quad 0x3e1f4f3de50f68f0
+ .quad 0x3df760ca757995e3
+ .quad 0x3dfc667ed3805147
+ .quad 0x3e3733f6196adf6f
+ .quad 0x3e2fb710f33e836b
+ .quad 0x3e39886eba641013
+ .quad 0x3dfb5368d0af8c1a
+ .quad 0x3e358c691b8d2971
+ .quad 0x3dfe9465226d08fb
+ .quad 0x3e33587e063f0097
+ .quad 0x3e3618e702129f18
+ .quad 0x3e361c33fc7aac54
+ .quad 0x3e3f07a68408604a
+ .quad 0x3e3c34bfe4945421
+ .quad 0x3e38b1f00e41300b
+ .quad 0x3e3f434284d61b63
+ .quad 0x3e3a63095e397436
+ .quad 0x3e34428656b919de
+ .quad 0x3e36ca9201b2d9a6
+ .quad 0x3e2738823a2a931c
+ .quad 0x3e3c11880e179230
+ .quad 0x3e313ddc8d6d52fe
+ .quad 0x3e33eed58922e917
+ .quad 0x3e295992846bdd50
+ .quad 0x3e0ddb4d5f2e278b
+ .quad 0x3df1a5f12a0635c4
+ .quad 0x3e4642f0882c3c34
+ .quad 0x3e2aee9ba7f6475e
+ .quad 0x3e264b7f834a60e4
+ .quad 0x3e290d42e243792e
+ .quad 0x3e4c272008134f01
+ .quad 0x3e4a782e16d6cf5b
+ .quad 0x3e44505c79da6648
+ .quad 0x3e4ca9d4ea4dcd21
+ .quad 0x3e297d3d627cd5bc
+ .quad 0x3e20b15cf9bcaa13
+ .quad 0x3e315b2063cf76dd
+ .quad 0x3e2983e6f3aa2748
+ .quad 0x3e3f4c64f4ffe994
+ .quad 0x3e46beba7ce85a0f
+ .quad 0x3e3b9c69fd4ea6b8
+ .quad 0x3e2b6aa5835fa4ab
+ .quad 0x3e43ccc3790fedd1
+ .quad 0x3e29c04cc4404fe0
+ .quad 0x3e40734b7a75d89d
+ .quad 0x3e1b4404c4e01612
+ .quad 0x3e40c565c2ce4894
+ .quad 0x3e33c71441d935cd
+ .quad 0x3d72a492556b3b4e
+ .quad 0x3e20fa090341dc43
+ .quad 0x3e2e8f7009e3d9f4
+ .quad 0x3e4b1bf68b048a45
+ .quad 0x3e3eee52dffaa956
+ .quad 0x3e456b0900e465bd
+ .quad 0x3e4d929ef5777666
+ .quad 0x3e486ea28637e260
+ .quad 0x3e4665aff10ca2f0
+ .quad 0x3e2f11fdaf48ec74
+ .quad 0x3e4cbe1b86a4d1c7
+ .quad 0x3e25b05bfea87665
+ .quad 0x3e41cec20a1a4a1d
+ .quad 0x3e41cd5f0a409b9f
+ .quad 0x3e453656c8265070
+ .quad 0x3e377ed835282260
+ .quad 0x3e2417bc3040b9d2
+ .quad 0x3e408eef7b79eff2
+ .quad 0x3e4dc76f39dc57e9
+ .quad 0x3e4c0493a70cf457
+ .quad 0x3e4a83d6cea5a60c
+ .quad 0x3e30d6700dc557ba
+ .quad 0x3e44c96c12e8bd0a
+ .quad 0x3e3d2c1993e32315
+ .quad 0x3e22c721135f8242
+ .quad 0x3e279a3e4dda747d
+ .quad 0x3dfcf89f6941a72b
+ .quad 0x3e2149a702f10831
+ .quad 0x3e4ead4b7c8175db
+ .quad 0x3e4e6930fe63e70a
+ .quad 0x3e41e106bed9ee2f
+ .quad 0x3e2d682b82f11c92
+ .quad 0x3e3a07f188dba47c
+ .quad 0x3e40f9342dc172f6
+ .quad 0x3e03ef3fde623e25
+
+.align 16
+.L__log_F_inv:
+ .quad 0x4000000000000000
+ .quad 0x3fffe01fe01fe020
+ .quad 0x3fffc07f01fc07f0
+ .quad 0x3fffa11caa01fa12
+ .quad 0x3fff81f81f81f820
+ .quad 0x3fff6310aca0dbb5
+ .quad 0x3fff44659e4a4271
+ .quad 0x3fff25f644230ab5
+ .quad 0x3fff07c1f07c1f08
+ .quad 0x3ffee9c7f8458e02
+ .quad 0x3ffecc07b301ecc0
+ .quad 0x3ffeae807aba01eb
+ .quad 0x3ffe9131abf0b767
+ .quad 0x3ffe741aa59750e4
+ .quad 0x3ffe573ac901e574
+ .quad 0x3ffe3a9179dc1a73
+ .quad 0x3ffe1e1e1e1e1e1e
+ .quad 0x3ffe01e01e01e01e
+ .quad 0x3ffde5d6e3f8868a
+ .quad 0x3ffdca01dca01dca
+ .quad 0x3ffdae6076b981db
+ .quad 0x3ffd92f2231e7f8a
+ .quad 0x3ffd77b654b82c34
+ .quad 0x3ffd5cac807572b2
+ .quad 0x3ffd41d41d41d41d
+ .quad 0x3ffd272ca3fc5b1a
+ .quad 0x3ffd0cb58f6ec074
+ .quad 0x3ffcf26e5c44bfc6
+ .quad 0x3ffcd85689039b0b
+ .quad 0x3ffcbe6d9601cbe7
+ .quad 0x3ffca4b3055ee191
+ .quad 0x3ffc8b265afb8a42
+ .quad 0x3ffc71c71c71c71c
+ .quad 0x3ffc5894d10d4986
+ .quad 0x3ffc3f8f01c3f8f0
+ .quad 0x3ffc26b5392ea01c
+ .quad 0x3ffc0e070381c0e0
+ .quad 0x3ffbf583ee868d8b
+ .quad 0x3ffbdd2b899406f7
+ .quad 0x3ffbc4fd65883e7b
+ .quad 0x3ffbacf914c1bad0
+ .quad 0x3ffb951e2b18ff23
+ .quad 0x3ffb7d6c3dda338b
+ .quad 0x3ffb65e2e3beee05
+ .quad 0x3ffb4e81b4e81b4f
+ .quad 0x3ffb37484ad806ce
+ .quad 0x3ffb2036406c80d9
+ .quad 0x3ffb094b31d922a4
+ .quad 0x3ffaf286bca1af28
+ .quad 0x3ffadbe87f94905e
+ .quad 0x3ffac5701ac5701b
+ .quad 0x3ffaaf1d2f87ebfd
+ .quad 0x3ffa98ef606a63be
+ .quad 0x3ffa82e65130e159
+ .quad 0x3ffa6d01a6d01a6d
+ .quad 0x3ffa574107688a4a
+ .quad 0x3ffa41a41a41a41a
+ .quad 0x3ffa2c2a87c51ca0
+ .quad 0x3ffa16d3f97a4b02
+ .quad 0x3ffa01a01a01a01a
+ .quad 0x3ff9ec8e951033d9
+ .quad 0x3ff9d79f176b682d
+ .quad 0x3ff9c2d14ee4a102
+ .quad 0x3ff9ae24ea5510da
+ .quad 0x3ff999999999999a
+ .quad 0x3ff9852f0d8ec0ff
+ .quad 0x3ff970e4f80cb872
+ .quad 0x3ff95cbb0be377ae
+ .quad 0x3ff948b0fcd6e9e0
+ .quad 0x3ff934c67f9b2ce6
+ .quad 0x3ff920fb49d0e229
+ .quad 0x3ff90d4f120190d5
+ .quad 0x3ff8f9c18f9c18fa
+ .quad 0x3ff8e6527af1373f
+ .quad 0x3ff8d3018d3018d3
+ .quad 0x3ff8bfce8062ff3a
+ .quad 0x3ff8acb90f6bf3aa
+ .quad 0x3ff899c0f601899c
+ .quad 0x3ff886e5f0abb04a
+ .quad 0x3ff87427bcc092b9
+ .quad 0x3ff8618618618618
+ .quad 0x3ff84f00c2780614
+ .quad 0x3ff83c977ab2bedd
+ .quad 0x3ff82a4a0182a4a0
+ .quad 0x3ff8181818181818
+ .quad 0x3ff8060180601806
+ .quad 0x3ff7f405fd017f40
+ .quad 0x3ff7e225515a4f1d
+ .quad 0x3ff7d05f417d05f4
+ .quad 0x3ff7beb3922e017c
+ .quad 0x3ff7ad2208e0ecc3
+ .quad 0x3ff79baa6bb6398b
+ .quad 0x3ff78a4c8178a4c8
+ .quad 0x3ff77908119ac60d
+ .quad 0x3ff767dce434a9b1
+ .quad 0x3ff756cac201756d
+ .quad 0x3ff745d1745d1746
+ .quad 0x3ff734f0c541fe8d
+ .quad 0x3ff724287f46debc
+ .quad 0x3ff713786d9c7c09
+ .quad 0x3ff702e05c0b8170
+ .quad 0x3ff6f26016f26017
+ .quad 0x3ff6e1f76b4337c7
+ .quad 0x3ff6d1a62681c861
+ .quad 0x3ff6c16c16c16c17
+ .quad 0x3ff6b1490aa31a3d
+ .quad 0x3ff6a13cd1537290
+ .quad 0x3ff691473a88d0c0
+ .quad 0x3ff6816816816817
+ .quad 0x3ff6719f3601671a
+ .quad 0x3ff661ec6a5122f9
+ .quad 0x3ff6524f853b4aa3
+ .quad 0x3ff642c8590b2164
+ .quad 0x3ff63356b88ac0de
+ .quad 0x3ff623fa77016240
+ .quad 0x3ff614b36831ae94
+ .quad 0x3ff6058160581606
+ .quad 0x3ff5f66434292dfc
+ .quad 0x3ff5e75bb8d015e7
+ .quad 0x3ff5d867c3ece2a5
+ .quad 0x3ff5c9882b931057
+ .quad 0x3ff5babcc647fa91
+ .quad 0x3ff5ac056b015ac0
+ .quad 0x3ff59d61f123ccaa
+ .quad 0x3ff58ed2308158ed
+ .quad 0x3ff5805601580560
+ .quad 0x3ff571ed3c506b3a
+ .quad 0x3ff56397ba7c52e2
+ .quad 0x3ff5555555555555
+ .quad 0x3ff54725e6bb82fe
+ .quad 0x3ff5390948f40feb
+ .quad 0x3ff52aff56a8054b
+ .quad 0x3ff51d07eae2f815
+ .quad 0x3ff50f22e111c4c5
+ .quad 0x3ff5015015015015
+ .quad 0x3ff4f38f62dd4c9b
+ .quad 0x3ff4e5e0a72f0539
+ .quad 0x3ff4d843bedc2c4c
+ .quad 0x3ff4cab88725af6e
+ .quad 0x3ff4bd3edda68fe1
+ .quad 0x3ff4afd6a052bf5b
+ .quad 0x3ff4a27fad76014a
+ .quad 0x3ff49539e3b2d067
+ .quad 0x3ff4880522014880
+ .quad 0x3ff47ae147ae147b
+ .quad 0x3ff46dce34596066
+ .quad 0x3ff460cbc7f5cf9a
+ .quad 0x3ff453d9e2c776ca
+ .quad 0x3ff446f86562d9fb
+ .quad 0x3ff43a2730abee4d
+ .quad 0x3ff42d6625d51f87
+ .quad 0x3ff420b5265e5951
+ .quad 0x3ff4141414141414
+ .quad 0x3ff40782d10e6566
+ .quad 0x3ff3fb013fb013fb
+ .quad 0x3ff3ee8f42a5af07
+ .quad 0x3ff3e22cbce4a902
+ .quad 0x3ff3d5d991aa75c6
+ .quad 0x3ff3c995a47babe7
+ .quad 0x3ff3bd60d9232955
+ .quad 0x3ff3b13b13b13b14
+ .quad 0x3ff3a524387ac822
+ .quad 0x3ff3991c2c187f63
+ .quad 0x3ff38d22d366088e
+ .quad 0x3ff3813813813814
+ .quad 0x3ff3755bd1c945ee
+ .quad 0x3ff3698df3de0748
+ .quad 0x3ff35dce5f9f2af8
+ .quad 0x3ff3521cfb2b78c1
+ .quad 0x3ff34679ace01346
+ .quad 0x3ff33ae45b57bcb2
+ .quad 0x3ff32f5ced6a1dfa
+ .quad 0x3ff323e34a2b10bf
+ .quad 0x3ff3187758e9ebb6
+ .quad 0x3ff30d190130d190
+ .quad 0x3ff301c82ac40260
+ .quad 0x3ff2f684bda12f68
+ .quad 0x3ff2eb4ea1fed14b
+ .quad 0x3ff2e025c04b8097
+ .quad 0x3ff2d50a012d50a0
+ .quad 0x3ff2c9fb4d812ca0
+ .quad 0x3ff2bef98e5a3711
+ .quad 0x3ff2b404ad012b40
+ .quad 0x3ff2a91c92f3c105
+ .quad 0x3ff29e4129e4129e
+ .quad 0x3ff293725bb804a5
+ .quad 0x3ff288b01288b013
+ .quad 0x3ff27dfa38a1ce4d
+ .quad 0x3ff27350b8812735
+ .quad 0x3ff268b37cd60127
+ .quad 0x3ff25e22708092f1
+ .quad 0x3ff2539d7e9177b2
+ .quad 0x3ff2492492492492
+ .quad 0x3ff23eb79717605b
+ .quad 0x3ff23456789abcdf
+ .quad 0x3ff22a0122a0122a
+ .quad 0x3ff21fb78121fb78
+ .quad 0x3ff21579804855e6
+ .quad 0x3ff20b470c67c0d9
+ .quad 0x3ff2012012012012
+ .quad 0x3ff1f7047dc11f70
+ .quad 0x3ff1ecf43c7fb84c
+ .quad 0x3ff1e2ef3b3fb874
+ .quad 0x3ff1d8f5672e4abd
+ .quad 0x3ff1cf06ada2811d
+ .quad 0x3ff1c522fc1ce059
+ .quad 0x3ff1bb4a4046ed29
+ .quad 0x3ff1b17c67f2bae3
+ .quad 0x3ff1a7b9611a7b96
+ .quad 0x3ff19e0119e0119e
+ .quad 0x3ff19453808ca29c
+ .quad 0x3ff18ab083902bdb
+ .quad 0x3ff1811811811812
+ .quad 0x3ff1778a191bd684
+ .quad 0x3ff16e0689427379
+ .quad 0x3ff1648d50fc3201
+ .quad 0x3ff15b1e5f75270d
+ .quad 0x3ff151b9a3fdd5c9
+ .quad 0x3ff1485f0e0acd3b
+ .quad 0x3ff13f0e8d344724
+ .quad 0x3ff135c81135c811
+ .quad 0x3ff12c8b89edc0ac
+ .quad 0x3ff12358e75d3033
+ .quad 0x3ff11a3019a74826
+ .quad 0x3ff1111111111111
+ .quad 0x3ff107fbbe011080
+ .quad 0x3ff0fef010fef011
+ .quad 0x3ff0f5edfab325a2
+ .quad 0x3ff0ecf56be69c90
+ .quad 0x3ff0e40655826011
+ .quad 0x3ff0db20a88f4696
+ .quad 0x3ff0d24456359e3a
+ .quad 0x3ff0c9714fbcda3b
+ .quad 0x3ff0c0a7868b4171
+ .quad 0x3ff0b7e6ec259dc8
+ .quad 0x3ff0af2f722eecb5
+ .quad 0x3ff0a6810a6810a7
+ .quad 0x3ff09ddba6af8360
+ .quad 0x3ff0953f39010954
+ .quad 0x3ff08cabb37565e2
+ .quad 0x3ff0842108421084
+ .quad 0x3ff07b9f29b8eae2
+ .quad 0x3ff073260a47f7c6
+ .quad 0x3ff06ab59c7912fb
+ .quad 0x3ff0624dd2f1a9fc
+ .quad 0x3ff059eea0727586
+ .quad 0x3ff05197f7d73404
+ .quad 0x3ff04949cc1664c5
+ .quad 0x3ff0410410410410
+ .quad 0x3ff038c6b78247fc
+ .quad 0x3ff03091b51f5e1a
+ .quad 0x3ff02864fc7729e9
+ .quad 0x3ff0204081020408
+ .quad 0x3ff0182436517a37
+ .quad 0x3ff0101010101010
+ .quad 0x3ff0080402010080
+ .quad 0x3ff0000000000000
+ .quad 0x0000000000000000
+
+
diff --git a/src/gas/log10f.S b/src/gas/log10f.S
new file mode 100644
index 0000000..eb89c6c
--- /dev/null
+++ b/src/gas/log10f.S
@@ -0,0 +1,745 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# log10f.S
+#
+# An implementation of the log10f libm function.
+#
+# Prototype:
+#
+# float log10f(float x);
+#
+
+#
+# Algorithm:
+# Similar to one presnted in log.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(log10f)
+#define fname_special _log10f_special@PLT
+
+
+# local variable storage offsets
+.equ p_temp, 0x0
+.equ stack_size, 0x18
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ sub $stack_size, %rsp
+
+ # compute exponent part
+ xor %eax, %eax
+ movdqa %xmm0, %xmm3
+ movss %xmm0, %xmm4
+ psrld $23, %xmm3
+ movd %xmm0, %eax
+ psubd .L__mask_127(%rip), %xmm3
+ movdqa %xmm0, %xmm2
+ cvtdq2ps %xmm3, %xmm5 # xexp
+
+ # NaN or inf
+ movdqa %xmm0, %xmm1
+ andps .L__real_inf(%rip), %xmm1
+ comiss .L__real_inf(%rip), %xmm1
+ je .L__x_is_inf_or_nan
+
+ # check for negative numbers or zero
+ xorps %xmm1, %xmm1
+ comiss %xmm1, %xmm0
+ jbe .L__x_is_zero_or_neg
+
+ pand .L__real_mant(%rip), %xmm2
+ subss .L__real_one(%rip), %xmm4
+
+ comiss .L__real_neg127(%rip), %xmm5
+ je .L__denormal_adjust
+
+.L__continue_common:
+
+ # compute index into the log tables
+ mov %eax, %r9d
+ and .L__mask_mant_all7(%rip), %eax
+ and .L__mask_mant8(%rip), %r9d
+ shl $1, %r9d
+ add %r9d, %eax
+ mov %eax, p_temp(%rsp)
+
+ # near one codepath
+ andps .L__real_notsign(%rip), %xmm4
+ comiss .L__real_threshold(%rip), %xmm4
+ jb .L__near_one
+
+ # F, Y
+ movss p_temp(%rsp), %xmm1
+ shr $16, %eax
+ por .L__real_half(%rip), %xmm2
+ por .L__real_half(%rip), %xmm1
+ lea .L__log_F_inv(%rip), %r9
+
+ # f = F - Y, r = f * inv
+ subss %xmm2, %xmm1
+ mulss (%r9,%rax,4), %xmm1
+
+ movss %xmm1, %xmm2
+ movss %xmm1, %xmm0
+
+ # poly
+ mulss .L__real_1_over_3(%rip), %xmm2
+ mulss %xmm1, %xmm0
+ addss .L__real_1_over_2(%rip), %xmm2
+ movss .L__real_log10_2_tail(%rip), %xmm3
+
+ lea .L__log_128_tail(%rip), %r9
+ lea .L__log_128_lead(%rip), %r10
+
+ mulss %xmm0, %xmm2
+ mulss %xmm5, %xmm3
+ addss %xmm2, %xmm1
+
+ mulss .L__real_log10_e(%rip), %xmm1
+
+ # m*log(10) + log10(G) - poly
+ movss .L__real_log10_2_lead(%rip), %xmm0
+ subss %xmm1, %xmm3 # z2
+ mulss %xmm5, %xmm0
+ addss (%r9,%rax,4), %xmm3
+ addss (%r10,%rax,4), %xmm0
+
+ addss %xmm3, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__near_one:
+ # r = x - 1.0#
+ movss .L__real_two(%rip), %xmm2
+ subss .L__real_one(%rip), %xmm0
+
+ # u = r / (2.0 + r)
+ addss %xmm0, %xmm2
+ movss %xmm0, %xmm1
+ divss %xmm2, %xmm1 # u
+
+ # correction = r * u
+ movss %xmm0, %xmm4
+ mulss %xmm1, %xmm4
+
+ # u = u + u#
+ addss %xmm1, %xmm1
+ movss %xmm1, %xmm2
+ mulss %xmm2, %xmm2 # v = u^2
+
+ # r2 = (u * v * (ca_1 + v * ca_2) - correction)
+ movss %xmm1, %xmm3
+ mulss %xmm2, %xmm3 # u^3
+ mulss .L__real_ca2(%rip), %xmm2 # Bu^2
+ addss .L__real_ca1(%rip), %xmm2 # +A
+ mulss %xmm3, %xmm2
+ subss %xmm4, %xmm2 # -correction
+
+ movdqa %xmm0, %xmm5
+ pand .L__mask_lower(%rip), %xmm5
+ subss %xmm5, %xmm0
+ addss %xmm0, %xmm2
+
+ movss %xmm5, %xmm0
+ movss %xmm2, %xmm1
+
+ mulss .L__real_log10_e_tail(%rip), %xmm2
+ mulss .L__real_log10_e_tail(%rip), %xmm0
+ mulss .L__real_log10_e_lead(%rip), %xmm1
+ mulss .L__real_log10_e_lead(%rip), %xmm5
+
+ addss %xmm2, %xmm0
+ addss %xmm1, %xmm0
+ addss %xmm5, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.L__denormal_adjust:
+ por .L__real_one(%rip), %xmm2
+ subss .L__real_one(%rip), %xmm2
+ movdqa %xmm2, %xmm5
+ pand .L__real_mant(%rip), %xmm2
+ movd %xmm2, %eax
+ psrld $23, %xmm5
+ psubd .L__mask_253(%rip), %xmm5
+ cvtdq2ps %xmm5, %xmm5
+ jmp .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+ jne .L__x_is_neg
+
+ movss .L__real_ninf(%rip), %xmm1
+ mov .L__flag_x_zero(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+ movss .L__real_nan(%rip), %xmm1
+ mov .L__flag_x_neg(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+ cmp .L__real_inf(%rip), %eax
+ je .L__finish
+
+ cmp .L__real_ninf(%rip), %eax
+ je .L__x_is_neg
+
+ mov .L__real_qnanbit(%rip), %r9d
+ and %eax, %r9d
+ jnz .L__finish
+
+ or .L__real_qnanbit(%rip), %eax
+ movd %eax, %xmm1
+ mov .L__flag_x_nan(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__finish:
+ add $stack_size, %rsp
+ ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero: .long 00000001
+.L__flag_x_neg: .long 00000002
+.L__flag_x_nan: .long 00000003
+
+.align 16
+
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+.L__real_neg_qnan: .quad 0x0ffc00000ffc00000
+ .quad 0x0ffc00000ffc00000
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantissa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+
+.L__mask_mant_all7: .quad 0x00000000007f0000
+ .quad 0x00000000007f0000
+.L__mask_mant8: .quad 0x0000000000008000
+ .quad 0x0000000000008000
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+
+.L__real_log10_e_lead: .quad 0x3EDE00003EDE0000 # log10e_lead 0.4335937500
+ .quad 0x3EDE00003EDE0000
+.L__real_log10_e_tail: .quad 0x3A37B1523A37B152 # log10e_tail 0.0007007319
+ .quad 0x3A37B1523A37B152
+
+.L__real_log10_2_lead: .quad 0x3e9a00003e9a0000
+ .quad 0x0000000000000000
+.L__real_log10_2_tail: .quad 0x39826a1339826a13
+ .quad 0x0000000000000000
+.L__real_log10_e: .quad 0x3ede5bd93ede5bd9
+ .quad 0x0000000000000000
+
+.L__mask_lower: .quad 0x0ffff0000ffff0000
+ .quad 0x0ffff0000ffff0000
+
+.align 16
+
+.L__real_neg127: .long 0x0c2fe0000
+ .long 0
+ .quad 0
+
+.L__mask_253: .long 0x000000fd
+ .long 0
+ .quad 0
+
+.L__real_threshold: .long 0x3d800000
+ .long 0
+ .quad 0
+
+.L__mask_01: .long 0x00000001
+ .long 0
+ .quad 0
+
+.L__mask_80: .long 0x00000080
+ .long 0
+ .quad 0
+
+.L__real_3b800000: .long 0x3b800000
+ .long 0
+ .quad 0
+
+.L__real_1_over_3: .long 0x3eaaaaab
+ .long 0
+ .quad 0
+
+.L__real_1_over_2: .long 0x3f000000
+ .long 0
+ .quad 0
+
+.align 16
+.L__log_128_lead:
+ .long 0x00000000
+ .long 0x3b5d4000
+ .long 0x3bdc8000
+ .long 0x3c24c000
+ .long 0x3c5ac000
+ .long 0x3c884000
+ .long 0x3ca2c000
+ .long 0x3cbd4000
+ .long 0x3cd78000
+ .long 0x3cf1c000
+ .long 0x3d05c000
+ .long 0x3d128000
+ .long 0x3d1f4000
+ .long 0x3d2c0000
+ .long 0x3d388000
+ .long 0x3d450000
+ .long 0x3d518000
+ .long 0x3d5dc000
+ .long 0x3d6a0000
+ .long 0x3d760000
+ .long 0x3d810000
+ .long 0x3d870000
+ .long 0x3d8d0000
+ .long 0x3d92c000
+ .long 0x3d98c000
+ .long 0x3d9e8000
+ .long 0x3da44000
+ .long 0x3daa0000
+ .long 0x3dafc000
+ .long 0x3db58000
+ .long 0x3dbb4000
+ .long 0x3dc0c000
+ .long 0x3dc64000
+ .long 0x3dcc0000
+ .long 0x3dd18000
+ .long 0x3dd6c000
+ .long 0x3ddc4000
+ .long 0x3de1c000
+ .long 0x3de70000
+ .long 0x3dec8000
+ .long 0x3df1c000
+ .long 0x3df70000
+ .long 0x3dfc4000
+ .long 0x3e00c000
+ .long 0x3e034000
+ .long 0x3e05c000
+ .long 0x3e088000
+ .long 0x3e0b0000
+ .long 0x3e0d8000
+ .long 0x3e100000
+ .long 0x3e128000
+ .long 0x3e150000
+ .long 0x3e178000
+ .long 0x3e1a0000
+ .long 0x3e1c8000
+ .long 0x3e1ec000
+ .long 0x3e214000
+ .long 0x3e23c000
+ .long 0x3e260000
+ .long 0x3e288000
+ .long 0x3e2ac000
+ .long 0x3e2d4000
+ .long 0x3e2f8000
+ .long 0x3e31c000
+ .long 0x3e344000
+ .long 0x3e368000
+ .long 0x3e38c000
+ .long 0x3e3b0000
+ .long 0x3e3d4000
+ .long 0x3e3fc000
+ .long 0x3e420000
+ .long 0x3e440000
+ .long 0x3e464000
+ .long 0x3e488000
+ .long 0x3e4ac000
+ .long 0x3e4d0000
+ .long 0x3e4f4000
+ .long 0x3e514000
+ .long 0x3e538000
+ .long 0x3e55c000
+ .long 0x3e57c000
+ .long 0x3e5a0000
+ .long 0x3e5c0000
+ .long 0x3e5e4000
+ .long 0x3e604000
+ .long 0x3e624000
+ .long 0x3e648000
+ .long 0x3e668000
+ .long 0x3e688000
+ .long 0x3e6ac000
+ .long 0x3e6cc000
+ .long 0x3e6ec000
+ .long 0x3e70c000
+ .long 0x3e72c000
+ .long 0x3e74c000
+ .long 0x3e76c000
+ .long 0x3e78c000
+ .long 0x3e7ac000
+ .long 0x3e7cc000
+ .long 0x3e7ec000
+ .long 0x3e804000
+ .long 0x3e814000
+ .long 0x3e824000
+ .long 0x3e834000
+ .long 0x3e840000
+ .long 0x3e850000
+ .long 0x3e860000
+ .long 0x3e870000
+ .long 0x3e880000
+ .long 0x3e88c000
+ .long 0x3e89c000
+ .long 0x3e8ac000
+ .long 0x3e8bc000
+ .long 0x3e8c8000
+ .long 0x3e8d8000
+ .long 0x3e8e8000
+ .long 0x3e8f4000
+ .long 0x3e904000
+ .long 0x3e914000
+ .long 0x3e920000
+ .long 0x3e930000
+ .long 0x3e93c000
+ .long 0x3e94c000
+ .long 0x3e958000
+ .long 0x3e968000
+ .long 0x3e978000
+ .long 0x3e984000
+ .long 0x3e994000
+ .long 0x3e9a0000
+
+.align 16
+.L__log_128_tail:
+ .long 0x00000000
+ .long 0x367a8e44
+ .long 0x368ed49f
+ .long 0x36c21451
+ .long 0x375211d6
+ .long 0x3720ea11
+ .long 0x37e9eb59
+ .long 0x37b87be7
+ .long 0x37bf2560
+ .long 0x33d597a0
+ .long 0x37806a05
+ .long 0x3820581f
+ .long 0x38223334
+ .long 0x378e3bac
+ .long 0x3810684f
+ .long 0x37feb7ae
+ .long 0x36a9d609
+ .long 0x37a68163
+ .long 0x376a8b27
+ .long 0x384c8fd6
+ .long 0x3885183e
+ .long 0x3874a760
+ .long 0x380d1154
+ .long 0x38ea42bd
+ .long 0x384c1571
+ .long 0x38ba66b8
+ .long 0x38e7da3b
+ .long 0x38eee632
+ .long 0x38d00911
+ .long 0x388bbede
+ .long 0x378a0512
+ .long 0x3894c7a0
+ .long 0x38e30710
+ .long 0x36db2829
+ .long 0x3729d609
+ .long 0x38fa0e82
+ .long 0x38bc9a75
+ .long 0x383a9297
+ .long 0x38dc83c8
+ .long 0x37eac335
+ .long 0x38706ac3
+ .long 0x389574c2
+ .long 0x3892d068
+ .long 0x38615032
+ .long 0x3917acf4
+ .long 0x3967a126
+ .long 0x38217840
+ .long 0x38b420ab
+ .long 0x38f9c7b2
+ .long 0x391103bd
+ .long 0x39169a6b
+ .long 0x390dd194
+ .long 0x38eda471
+ .long 0x38a38950
+ .long 0x37f6844a
+ .long 0x395e1cdb
+ .long 0x390fcffc
+ .long 0x38503e9d
+ .long 0x394b00fd
+ .long 0x38a9910a
+ .long 0x39518a31
+ .long 0x3882d2c2
+ .long 0x392488e4
+ .long 0x397b0aff
+ .long 0x388a22d8
+ .long 0x3902bd5e
+ .long 0x39342f85
+ .long 0x39598811
+ .long 0x3972e6b1
+ .long 0x34d53654
+ .long 0x360ca25e
+ .long 0x39785cc0
+ .long 0x39630710
+ .long 0x39424ed7
+ .long 0x39165101
+ .long 0x38be5421
+ .long 0x37e7b0c0
+ .long 0x394fd0c3
+ .long 0x38efaaaa
+ .long 0x37a8f566
+ .long 0x3927c744
+ .long 0x383fa4d5
+ .long 0x392d9e39
+ .long 0x3803feae
+ .long 0x390a268c
+ .long 0x39692b80
+ .long 0x38789b4f
+ .long 0x3909307d
+ .long 0x394a601c
+ .long 0x35e67edc
+ .long 0x383e386d
+ .long 0x38a7743d
+ .long 0x38dccec3
+ .long 0x38ff57e0
+ .long 0x39079d8b
+ .long 0x390651a6
+ .long 0x38f7bad9
+ .long 0x38d0ab82
+ .long 0x38979e7d
+ .long 0x381978ee
+ .long 0x397816c8
+ .long 0x39410cb2
+ .long 0x39015384
+ .long 0x3863fa28
+ .long 0x39f41065
+ .long 0x39c7668a
+ .long 0x39968afa
+ .long 0x39430db9
+ .long 0x38a18cf3
+ .long 0x39eb2907
+ .long 0x39a9e10c
+ .long 0x39492800
+ .long 0x385a53d1
+ .long 0x39ce0cf7
+ .long 0x3979c7b2
+ .long 0x389f5d99
+ .long 0x39ceefcb
+ .long 0x39646a39
+ .long 0x380d7a9b
+ .long 0x39ad6650
+ .long 0x390ac3b8
+ .long 0x39d9a9a8
+ .long 0x39548a99
+ .long 0x39f73c4b
+ .long 0x3980960e
+ .long 0x374b3d5a
+ .long 0x39888f1e
+ .long 0x37679a07
+ .long 0x39826a13
+
+.align 16
+.L__log_F_inv:
+ .long 0x40000000
+ .long 0x3ffe03f8
+ .long 0x3ffc0fc1
+ .long 0x3ffa232d
+ .long 0x3ff83e10
+ .long 0x3ff6603e
+ .long 0x3ff4898d
+ .long 0x3ff2b9d6
+ .long 0x3ff0f0f1
+ .long 0x3fef2eb7
+ .long 0x3fed7304
+ .long 0x3febbdb3
+ .long 0x3fea0ea1
+ .long 0x3fe865ac
+ .long 0x3fe6c2b4
+ .long 0x3fe52598
+ .long 0x3fe38e39
+ .long 0x3fe1fc78
+ .long 0x3fe07038
+ .long 0x3fdee95c
+ .long 0x3fdd67c9
+ .long 0x3fdbeb62
+ .long 0x3fda740e
+ .long 0x3fd901b2
+ .long 0x3fd79436
+ .long 0x3fd62b81
+ .long 0x3fd4c77b
+ .long 0x3fd3680d
+ .long 0x3fd20d21
+ .long 0x3fd0b6a0
+ .long 0x3fcf6475
+ .long 0x3fce168a
+ .long 0x3fcccccd
+ .long 0x3fcb8728
+ .long 0x3fca4588
+ .long 0x3fc907da
+ .long 0x3fc7ce0c
+ .long 0x3fc6980c
+ .long 0x3fc565c8
+ .long 0x3fc43730
+ .long 0x3fc30c31
+ .long 0x3fc1e4bc
+ .long 0x3fc0c0c1
+ .long 0x3fbfa030
+ .long 0x3fbe82fa
+ .long 0x3fbd6910
+ .long 0x3fbc5264
+ .long 0x3fbb3ee7
+ .long 0x3fba2e8c
+ .long 0x3fb92144
+ .long 0x3fb81703
+ .long 0x3fb70fbb
+ .long 0x3fb60b61
+ .long 0x3fb509e7
+ .long 0x3fb40b41
+ .long 0x3fb30f63
+ .long 0x3fb21643
+ .long 0x3fb11fd4
+ .long 0x3fb02c0b
+ .long 0x3faf3ade
+ .long 0x3fae4c41
+ .long 0x3fad602b
+ .long 0x3fac7692
+ .long 0x3fab8f6a
+ .long 0x3faaaaab
+ .long 0x3fa9c84a
+ .long 0x3fa8e83f
+ .long 0x3fa80a81
+ .long 0x3fa72f05
+ .long 0x3fa655c4
+ .long 0x3fa57eb5
+ .long 0x3fa4a9cf
+ .long 0x3fa3d70a
+ .long 0x3fa3065e
+ .long 0x3fa237c3
+ .long 0x3fa16b31
+ .long 0x3fa0a0a1
+ .long 0x3f9fd80a
+ .long 0x3f9f1166
+ .long 0x3f9e4cad
+ .long 0x3f9d89d9
+ .long 0x3f9cc8e1
+ .long 0x3f9c09c1
+ .long 0x3f9b4c70
+ .long 0x3f9a90e8
+ .long 0x3f99d723
+ .long 0x3f991f1a
+ .long 0x3f9868c8
+ .long 0x3f97b426
+ .long 0x3f97012e
+ .long 0x3f964fda
+ .long 0x3f95a025
+ .long 0x3f94f209
+ .long 0x3f944581
+ .long 0x3f939a86
+ .long 0x3f92f114
+ .long 0x3f924925
+ .long 0x3f91a2b4
+ .long 0x3f90fdbc
+ .long 0x3f905a38
+ .long 0x3f8fb824
+ .long 0x3f8f177a
+ .long 0x3f8e7835
+ .long 0x3f8dda52
+ .long 0x3f8d3dcb
+ .long 0x3f8ca29c
+ .long 0x3f8c08c1
+ .long 0x3f8b7034
+ .long 0x3f8ad8f3
+ .long 0x3f8a42f8
+ .long 0x3f89ae41
+ .long 0x3f891ac7
+ .long 0x3f888889
+ .long 0x3f87f781
+ .long 0x3f8767ab
+ .long 0x3f86d905
+ .long 0x3f864b8a
+ .long 0x3f85bf37
+ .long 0x3f853408
+ .long 0x3f84a9fa
+ .long 0x3f842108
+ .long 0x3f839930
+ .long 0x3f83126f
+ .long 0x3f828cc0
+ .long 0x3f820821
+ .long 0x3f81848e
+ .long 0x3f810204
+ .long 0x3f808081
+ .long 0x3f800000
+
+
diff --git a/src/gas/log2.S b/src/gas/log2.S
new file mode 100644
index 0000000..0c791b5
--- /dev/null
+++ b/src/gas/log2.S
@@ -0,0 +1,1132 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# log2.S
+#
+# An implementation of the log2 libm function.
+#
+# Prototype:
+#
+# double log2(double x);
+#
+
+#
+# Algorithm:
+# Similar to one presnted in log.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(log2)
+#define fname_special _log2_special@PLT
+
+
+# local variable storage offsets
+.equ p_temp, 0x0
+.equ stack_size, 0x18
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ sub $stack_size, %rsp
+
+ # compute exponent part
+ xor %rax, %rax
+ movdqa %xmm0, %xmm3
+ movsd %xmm0, %xmm4
+ psrlq $52, %xmm3
+ movd %xmm0, %rax
+ psubq .L__mask_1023(%rip), %xmm3
+ movdqa %xmm0, %xmm2
+ cvtdq2pd %xmm3, %xmm6 # xexp
+
+ # NaN or inf
+ movdqa %xmm0, %xmm5
+ andpd .L__real_inf(%rip), %xmm5
+ comisd .L__real_inf(%rip), %xmm5
+ je .L__x_is_inf_or_nan
+
+ # check for negative numbers or zero
+ xorpd %xmm5, %xmm5
+ comisd %xmm5, %xmm0
+ jbe .L__x_is_zero_or_neg
+
+ pand .L__real_mant(%rip), %xmm2
+ subsd .L__real_one(%rip), %xmm4
+
+ comisd .L__mask_1023_f(%rip), %xmm6
+ je .L__denormal_adjust
+
+.L__continue_common:
+
+ # compute index into the log tables
+ mov %rax, %r9
+ and .L__mask_mant_all8(%rip), %rax
+ and .L__mask_mant9(%rip), %r9
+ shl $1, %r9
+ add %r9, %rax
+ mov %rax, p_temp(%rsp)
+
+ # near one codepath
+ andpd .L__real_notsign(%rip), %xmm4
+ comisd .L__real_threshold(%rip), %xmm4
+ jb .L__near_one
+
+ # F, Y
+ movsd p_temp(%rsp), %xmm1
+ shr $44, %rax
+ por .L__real_half(%rip), %xmm2
+ por .L__real_half(%rip), %xmm1
+ lea .L__log_F_inv(%rip), %r9
+
+ # f = F - Y, r = f * inv
+ subsd %xmm2, %xmm1
+ mulsd (%r9,%rax,8), %xmm1
+
+ movsd %xmm1, %xmm2
+ movsd %xmm1, %xmm0
+ lea .L__log_256_lead(%rip), %r9
+
+ # poly
+ movsd .L__real_1_over_6(%rip), %xmm3
+ movsd .L__real_1_over_3(%rip), %xmm1
+ mulsd %xmm2, %xmm3
+ mulsd %xmm2, %xmm1
+ mulsd %xmm2, %xmm0
+ movsd %xmm0, %xmm4
+ addsd .L__real_1_over_5(%rip), %xmm3
+ addsd .L__real_1_over_2(%rip), %xmm1
+ mulsd %xmm0, %xmm4
+ mulsd %xmm2, %xmm3
+ mulsd %xmm0, %xmm1
+ addsd .L__real_1_over_4(%rip), %xmm3
+ addsd %xmm2, %xmm1
+ mulsd %xmm4, %xmm3
+ addsd %xmm3, %xmm1
+
+ mulsd .L__real_log2_e(%rip), %xmm1
+
+ # m + log2(G) - poly*log2_e
+ movsd (%r9,%rax,8), %xmm0
+ lea .L__log_256_tail(%rip), %rdx
+ movsd (%rdx,%rax,8), %xmm2
+ subsd %xmm1, %xmm2
+
+ addsd %xmm6, %xmm0
+ addsd %xmm2, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__near_one:
+
+ # r = x - 1.0
+ movsd .L__real_two(%rip), %xmm2
+ subsd .L__real_one(%rip), %xmm0 # r
+
+ addsd %xmm0, %xmm2
+ movsd %xmm0, %xmm1
+ divsd %xmm2, %xmm1 # r/(2+r) = u/2
+
+ movsd .L__real_ca2(%rip), %xmm4
+ movsd .L__real_ca4(%rip), %xmm5
+
+ movsd %xmm0, %xmm6
+ mulsd %xmm1, %xmm6 # correction
+
+ addsd %xmm1, %xmm1 # u
+ movsd %xmm1, %xmm2
+
+ mulsd %xmm1, %xmm2 # u^2
+
+ mulsd %xmm2, %xmm4
+ mulsd %xmm2, %xmm5
+
+ addsd .L__real_ca1(%rip), %xmm4
+ addsd .L__real_ca3(%rip), %xmm5
+
+ mulsd %xmm1, %xmm2 # u^3
+ mulsd %xmm2, %xmm4
+
+ mulsd %xmm2, %xmm2
+ mulsd %xmm1, %xmm2 # u^7
+ mulsd %xmm2, %xmm5
+
+ addsd %xmm5, %xmm4
+ subsd %xmm6, %xmm4
+
+ movdqa %xmm0, %xmm3
+ pand .L__mask_lower(%rip), %xmm3
+ subsd %xmm3, %xmm0
+ addsd %xmm0, %xmm4
+
+ movsd %xmm3, %xmm0
+ movsd %xmm4, %xmm1
+
+ mulsd .L__real_log2_e_tail(%rip), %xmm4
+ mulsd .L__real_log2_e_tail(%rip), %xmm0
+ mulsd .L__real_log2_e_lead(%rip), %xmm1
+ mulsd .L__real_log2_e_lead(%rip), %xmm3
+
+ addsd %xmm4, %xmm0
+ addsd %xmm1, %xmm0
+ addsd %xmm3, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__denormal_adjust:
+ por .L__real_one(%rip), %xmm2
+ subsd .L__real_one(%rip), %xmm2
+ movsd %xmm2, %xmm5
+ pand .L__real_mant(%rip), %xmm2
+ movd %xmm2, %rax
+ psrlq $52, %xmm5
+ psubd .L__mask_2045(%rip), %xmm5
+ cvtdq2pd %xmm5, %xmm6
+ jmp .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+ jne .L__x_is_neg
+
+ movsd .L__real_ninf(%rip), %xmm1
+ mov .L__flag_x_zero(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+ movsd .L__real_qnan(%rip), %xmm1
+ mov .L__flag_x_neg(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+ cmp .L__real_inf(%rip), %rax
+ je .L__finish
+
+ cmp .L__real_ninf(%rip), %rax
+ je .L__x_is_neg
+
+ mov .L__real_qnanbit(%rip), %r9
+ and %rax, %r9
+ jnz .L__finish
+
+ or .L__real_qnanbit(%rip), %rax
+ movd %rax, %xmm1
+ mov .L__flag_x_nan(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__finish:
+ add $stack_size, %rsp
+ ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero: .long 00000001
+.L__flag_x_neg: .long 00000002
+.L__flag_x_nan: .long 00000003
+
+.align 16
+
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0000000000000000
+.L__real_inf: .quad 0x7ff0000000000000 # +inf
+ .quad 0x0000000000000000
+.L__real_qnan: .quad 0x7ff8000000000000 # qNaN
+ .quad 0x0000000000000000
+.L__real_qnanbit: .quad 0x0008000000000000
+ .quad 0x0000000000000000
+.L__real_mant: .quad 0x000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000000000000000
+.L__mask_1023: .quad 0x00000000000003ff
+ .quad 0x0000000000000000
+.L__mask_001: .quad 0x0000000000000001
+ .quad 0x0000000000000000
+
+.L__mask_mant_all8: .quad 0x000ff00000000000
+ .quad 0x0000000000000000
+.L__mask_mant9: .quad 0x0000080000000000
+ .quad 0x0000000000000000
+
+.L__real_log2_e: .quad 0x3ff71547652b82fe
+ .quad 0x0000000000000000
+
+.L__real_log2_e_lead: .quad 0x3ff7154400000000 # log2e_lead 1.44269180297851562500E+00
+ .quad 0x0000000000000000
+.L__real_log2_e_tail: .quad 0x3ecb295c17f0bbbe # log2e_tail 3.23791044778235969970E-06
+ .quad 0x0000000000000000
+
+.L__real_two: .quad 0x4000000000000000 # 2
+ .quad 0x0000000000000000
+
+.L__real_one: .quad 0x3ff0000000000000 # 1
+ .quad 0x0000000000000000
+
+.L__real_half: .quad 0x3fe0000000000000 # 1/2
+ .quad 0x0000000000000000
+
+.L__mask_100: .quad 0x0000000000000100
+ .quad 0x0000000000000000
+
+.L__real_1_over_512: .quad 0x3f60000000000000
+ .quad 0x0000000000000000
+
+.L__real_1_over_2: .quad 0x3fe0000000000000
+ .quad 0x0000000000000000
+.L__real_1_over_3: .quad 0x3fd5555555555555
+ .quad 0x0000000000000000
+.L__real_1_over_4: .quad 0x3fd0000000000000
+ .quad 0x0000000000000000
+.L__real_1_over_5: .quad 0x3fc999999999999a
+ .quad 0x0000000000000000
+.L__real_1_over_6: .quad 0x3fc5555555555555
+ .quad 0x0000000000000000
+
+.L__mask_1023_f: .quad 0x0c08ff80000000000
+ .quad 0x0000000000000000
+
+.L__mask_2045: .quad 0x00000000000007fd
+ .quad 0x0000000000000000
+
+.L__real_threshold: .quad 0x3fb0000000000000 # .0625
+ .quad 0x0000000000000000
+
+.L__real_notsign: .quad 0x7ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x0000000000000000
+
+.L__real_ca1: .quad 0x3fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x0000000000000000
+.L__real_ca2: .quad 0x3f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x0000000000000000
+.L__real_ca3: .quad 0x3f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x0000000000000000
+.L__real_ca4: .quad 0x3f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x0000000000000000
+
+.L__mask_lower: .quad 0x0ffffffff00000000
+ .quad 0x0000000000000000
+
+.align 16
+.L__log_256_lead:
+ .quad 0x0000000000000000
+ .quad 0x3f7709c460000000
+ .quad 0x3f86fe50b0000000
+ .quad 0x3f91363110000000
+ .quad 0x3f96e79680000000
+ .quad 0x3f9c9363b0000000
+ .quad 0x3fa11cd1d0000000
+ .quad 0x3fa3ed3090000000
+ .quad 0x3fa6bad370000000
+ .quad 0x3fa985bfc0000000
+ .quad 0x3fac4dfab0000000
+ .quad 0x3faf138980000000
+ .quad 0x3fb0eb3890000000
+ .quad 0x3fb24b5b70000000
+ .quad 0x3fb3aa2fd0000000
+ .quad 0x3fb507b830000000
+ .quad 0x3fb663f6f0000000
+ .quad 0x3fb7beee90000000
+ .quad 0x3fb918a160000000
+ .quad 0x3fba7111d0000000
+ .quad 0x3fbbc84240000000
+ .quad 0x3fbd1e34e0000000
+ .quad 0x3fbe72ec10000000
+ .quad 0x3fbfc66a00000000
+ .quad 0x3fc08c5880000000
+ .quad 0x3fc134e1b0000000
+ .quad 0x3fc1dcd190000000
+ .quad 0x3fc2842940000000
+ .quad 0x3fc32ae9e0000000
+ .quad 0x3fc3d11460000000
+ .quad 0x3fc476a9f0000000
+ .quad 0x3fc51bab90000000
+ .quad 0x3fc5c01a30000000
+ .quad 0x3fc663f6f0000000
+ .quad 0x3fc70742d0000000
+ .quad 0x3fc7a9fec0000000
+ .quad 0x3fc84c2bd0000000
+ .quad 0x3fc8edcae0000000
+ .quad 0x3fc98edd00000000
+ .quad 0x3fca2f6320000000
+ .quad 0x3fcacf5e20000000
+ .quad 0x3fcb6ecf10000000
+ .quad 0x3fcc0db6c0000000
+ .quad 0x3fccac1630000000
+ .quad 0x3fcd49ee40000000
+ .quad 0x3fcde73fe0000000
+ .quad 0x3fce840be0000000
+ .quad 0x3fcf205330000000
+ .quad 0x3fcfbc16b0000000
+ .quad 0x3fd02baba0000000
+ .quad 0x3fd0790ad0000000
+ .quad 0x3fd0c62970000000
+ .quad 0x3fd11307d0000000
+ .quad 0x3fd15fa670000000
+ .quad 0x3fd1ac05b0000000
+ .quad 0x3fd1f825f0000000
+ .quad 0x3fd24407a0000000
+ .quad 0x3fd28fab30000000
+ .quad 0x3fd2db10f0000000
+ .quad 0x3fd3263960000000
+ .quad 0x3fd37124c0000000
+ .quad 0x3fd3bbd3a0000000
+ .quad 0x3fd4064630000000
+ .quad 0x3fd4507cf0000000
+ .quad 0x3fd49a7840000000
+ .quad 0x3fd4e43880000000
+ .quad 0x3fd52dbdf0000000
+ .quad 0x3fd5770910000000
+ .quad 0x3fd5c01a30000000
+ .quad 0x3fd608f1b0000000
+ .quad 0x3fd6518fe0000000
+ .quad 0x3fd699f520000000
+ .quad 0x3fd6e221c0000000
+ .quad 0x3fd72a1630000000
+ .quad 0x3fd771d2b0000000
+ .quad 0x3fd7b957a0000000
+ .quad 0x3fd800a560000000
+ .quad 0x3fd847bc30000000
+ .quad 0x3fd88e9c70000000
+ .quad 0x3fd8d54670000000
+ .quad 0x3fd91bba80000000
+ .quad 0x3fd961f900000000
+ .quad 0x3fd9a80230000000
+ .quad 0x3fd9edd670000000
+ .quad 0x3fda337600000000
+ .quad 0x3fda78e140000000
+ .quad 0x3fdabe1870000000
+ .quad 0x3fdb031be0000000
+ .quad 0x3fdb47ebf0000000
+ .quad 0x3fdb8c88d0000000
+ .quad 0x3fdbd0f2e0000000
+ .quad 0x3fdc152a60000000
+ .quad 0x3fdc592fa0000000
+ .quad 0x3fdc9d02f0000000
+ .quad 0x3fdce0a490000000
+ .quad 0x3fdd2414c0000000
+ .quad 0x3fdd6753e0000000
+ .quad 0x3fddaa6220000000
+ .quad 0x3fdded3fd0000000
+ .quad 0x3fde2fed30000000
+ .quad 0x3fde726aa0000000
+ .quad 0x3fdeb4b840000000
+ .quad 0x3fdef6d670000000
+ .quad 0x3fdf38c560000000
+ .quad 0x3fdf7a8560000000
+ .quad 0x3fdfbc16b0000000
+ .quad 0x3fdffd7990000000
+ .quad 0x3fe01f5720000000
+ .quad 0x3fe03fda80000000
+ .quad 0x3fe0604710000000
+ .quad 0x3fe0809cf0000000
+ .quad 0x3fe0a0dc30000000
+ .quad 0x3fe0c10500000000
+ .quad 0x3fe0e11770000000
+ .quad 0x3fe10113b0000000
+ .quad 0x3fe120f9d0000000
+ .quad 0x3fe140c9f0000000
+ .quad 0x3fe1608440000000
+ .quad 0x3fe18028c0000000
+ .quad 0x3fe19fb7b0000000
+ .quad 0x3fe1bf3110000000
+ .quad 0x3fe1de9510000000
+ .quad 0x3fe1fde3d0000000
+ .quad 0x3fe21d1d50000000
+ .quad 0x3fe23c41d0000000
+ .quad 0x3fe25b5150000000
+ .quad 0x3fe27a4c00000000
+ .quad 0x3fe29931f0000000
+ .quad 0x3fe2b80340000000
+ .quad 0x3fe2d6c010000000
+ .quad 0x3fe2f56870000000
+ .quad 0x3fe313fc80000000
+ .quad 0x3fe3327c60000000
+ .quad 0x3fe350e830000000
+ .quad 0x3fe36f3ff0000000
+ .quad 0x3fe38d83e0000000
+ .quad 0x3fe3abb3f0000000
+ .quad 0x3fe3c9d060000000
+ .quad 0x3fe3e7d930000000
+ .quad 0x3fe405ce80000000
+ .quad 0x3fe423b070000000
+ .quad 0x3fe4417f20000000
+ .quad 0x3fe45f3a90000000
+ .quad 0x3fe47ce2f0000000
+ .quad 0x3fe49a7840000000
+ .quad 0x3fe4b7fab0000000
+ .quad 0x3fe4d56a50000000
+ .quad 0x3fe4f2c740000000
+ .quad 0x3fe5101180000000
+ .quad 0x3fe52d4940000000
+ .quad 0x3fe54a6e80000000
+ .quad 0x3fe5678170000000
+ .quad 0x3fe5848220000000
+ .quad 0x3fe5a170a0000000
+ .quad 0x3fe5be4d00000000
+ .quad 0x3fe5db1770000000
+ .quad 0x3fe5f7cff0000000
+ .quad 0x3fe61476a0000000
+ .quad 0x3fe6310b80000000
+ .quad 0x3fe64d8ed0000000
+ .quad 0x3fe66a0080000000
+ .quad 0x3fe68660c0000000
+ .quad 0x3fe6a2af90000000
+ .quad 0x3fe6beed20000000
+ .quad 0x3fe6db1960000000
+ .quad 0x3fe6f73480000000
+ .quad 0x3fe7133e90000000
+ .quad 0x3fe72f37a0000000
+ .quad 0x3fe74b1fd0000000
+ .quad 0x3fe766f720000000
+ .quad 0x3fe782bdb0000000
+ .quad 0x3fe79e73a0000000
+ .quad 0x3fe7ba18f0000000
+ .quad 0x3fe7d5adc0000000
+ .quad 0x3fe7f13220000000
+ .quad 0x3fe80ca620000000
+ .quad 0x3fe82809d0000000
+ .quad 0x3fe8435d50000000
+ .quad 0x3fe85ea0b0000000
+ .quad 0x3fe879d3f0000000
+ .quad 0x3fe894f740000000
+ .quad 0x3fe8b00aa0000000
+ .quad 0x3fe8cb0e30000000
+ .quad 0x3fe8e60200000000
+ .quad 0x3fe900e610000000
+ .quad 0x3fe91bba80000000
+ .quad 0x3fe9367f60000000
+ .quad 0x3fe95134d0000000
+ .quad 0x3fe96bdad0000000
+ .quad 0x3fe9867170000000
+ .quad 0x3fe9a0f8d0000000
+ .quad 0x3fe9bb70f0000000
+ .quad 0x3fe9d5d9f0000000
+ .quad 0x3fe9f033e0000000
+ .quad 0x3fea0a7ed0000000
+ .quad 0x3fea24bad0000000
+ .quad 0x3fea3ee7f0000000
+ .quad 0x3fea590640000000
+ .quad 0x3fea7315d0000000
+ .quad 0x3fea8d16b0000000
+ .quad 0x3feaa708f0000000
+ .quad 0x3feac0eca0000000
+ .quad 0x3feadac1e0000000
+ .quad 0x3feaf488b0000000
+ .quad 0x3feb0e4120000000
+ .quad 0x3feb27eb40000000
+ .quad 0x3feb418730000000
+ .quad 0x3feb5b14f0000000
+ .quad 0x3feb749480000000
+ .quad 0x3feb8e0620000000
+ .quad 0x3feba769b0000000
+ .quad 0x3febc0bf50000000
+ .quad 0x3febda0710000000
+ .quad 0x3febf34110000000
+ .quad 0x3fec0c6d40000000
+ .quad 0x3fec258bc0000000
+ .quad 0x3fec3e9ca0000000
+ .quad 0x3fec579fe0000000
+ .quad 0x3fec7095a0000000
+ .quad 0x3fec897df0000000
+ .quad 0x3feca258d0000000
+ .quad 0x3fecbb2660000000
+ .quad 0x3fecd3e6a0000000
+ .quad 0x3fecec9990000000
+ .quad 0x3fed053f60000000
+ .quad 0x3fed1dd810000000
+ .quad 0x3fed3663b0000000
+ .quad 0x3fed4ee240000000
+ .quad 0x3fed6753e0000000
+ .quad 0x3fed7fb890000000
+ .quad 0x3fed981060000000
+ .quad 0x3fedb05b60000000
+ .quad 0x3fedc899a0000000
+ .quad 0x3fede0cb30000000
+ .quad 0x3fedf8f020000000
+ .quad 0x3fee110860000000
+ .quad 0x3fee291420000000
+ .quad 0x3fee411360000000
+ .quad 0x3fee590630000000
+ .quad 0x3fee70eca0000000
+ .quad 0x3fee88c6b0000000
+ .quad 0x3feea09470000000
+ .quad 0x3feeb855f0000000
+ .quad 0x3feed00b40000000
+ .quad 0x3feee7b470000000
+ .quad 0x3feeff5180000000
+ .quad 0x3fef16e280000000
+ .quad 0x3fef2e6780000000
+ .quad 0x3fef45e080000000
+ .quad 0x3fef5d4da0000000
+ .quad 0x3fef74aef0000000
+ .quad 0x3fef8c0460000000
+ .quad 0x3fefa34e10000000
+ .quad 0x3fefba8c00000000
+ .quad 0x3fefd1be40000000
+ .quad 0x3fefe8e4f0000000
+ .quad 0x3ff0000000000000
+
+.align 16
+.L__log_256_tail:
+ .quad 0x0000000000000000
+ .quad 0x3deaf558ee95b37a
+ .quad 0x3debbc2145fe38de
+ .quad 0x3dfea5ec312ed069
+ .quad 0x3df70b48a629b89f
+ .quad 0x3e050a1f0cccdd01
+ .quad 0x3e044cd04bb60514
+ .quad 0x3e01a16898809d2d
+ .quad 0x3e063bf61cc4d81b
+ .quad 0x3dfa4a8ca305071d
+ .quad 0x3e121556bde9f1f0
+ .quad 0x3df9929cfd0e6835
+ .quad 0x3e2f453f35679ee9
+ .quad 0x3e2c26b47913459e
+ .quad 0x3e2a4fe385b009a2
+ .quad 0x3e180ceedb53cb4d
+ .quad 0x3e2592262cf998a7
+ .quad 0x3e1ae28a04f106b8
+ .quad 0x3e2c8c66b55ce464
+ .quad 0x3e2e690927d688b0
+ .quad 0x3de5b5774c7658b4
+ .quad 0x3e0adc16d26859c7
+ .quad 0x3df7fa5b21cbdb5d
+ .quad 0x3e2e160149209a68
+ .quad 0x3e39b4f3c72c4f78
+ .quad 0x3e222418b7fcd690
+ .quad 0x3e2d54aded7a9150
+ .quad 0x3e360f4c7f1aed15
+ .quad 0x3e13c570d0fa8f96
+ .quad 0x3e3b3514c7e0166e
+ .quad 0x3e3307ee9a6271d2
+ .quad 0x3dee9722922c0226
+ .quad 0x3e33f7ad0f3f4016
+ .quad 0x3e3592262cf998a7
+ .quad 0x3e23bc09fca70073
+ .quad 0x3e2f41777bc5f936
+ .quad 0x3dd781d97ee91247
+ .quad 0x3e306a56d76b9a84
+ .quad 0x3e2df9c37c0beb3a
+ .quad 0x3e1905c35651c429
+ .quad 0x3e3b69d927dfc23d
+ .quad 0x3e2d7e57a5afb633
+ .quad 0x3e3bb29bdc81c4db
+ .quad 0x3e38ee1b912d8994
+ .quad 0x3e3864b2df91e96a
+ .quad 0x3e1d8a40770df213
+ .quad 0x3e2d39a9331f27cf
+ .quad 0x3e32411e4e8eea54
+ .quad 0x3e3204d0144751b3
+ .quad 0x3e2268331dd8bd0b
+ .quad 0x3e47606012de0634
+ .quad 0x3e3550aa3a93ec7e
+ .quad 0x3e45a616eb9612e0
+ .quad 0x3e3aec23fd65f8e1
+ .quad 0x3e248f838294639c
+ .quad 0x3e3b62384cafa1a3
+ .quad 0x3e461c0e73048b72
+ .quad 0x3e36cc9a0d8c0e85
+ .quad 0x3e489b355ede26f4
+ .quad 0x3e2b5941acd71f1e
+ .quad 0x3e4d499bd9b32266
+ .quad 0x3e043b9f52b061ba
+ .quad 0x3e46360892eb65e6
+ .quad 0x3e4dba9f8729ab41
+ .quad 0x3e479a3715fc9257
+ .quad 0x3e0d1f6d3f77ae38
+ .quad 0x3e48992d66fb9ec1
+ .quad 0x3e4666f195620f03
+ .quad 0x3e43f7ad0f3f4016
+ .quad 0x3e30a522b65bc039
+ .quad 0x3e319dee9b9489e3
+ .quad 0x3e323352e1a31521
+ .quad 0x3e4b3a19bcaf1aa4
+ .quad 0x3e3f2f060a50d366
+ .quad 0x3e44fdf677c8dfd9
+ .quad 0x3e48a35588aec6df
+ .quad 0x3e28b0e2a19575b0
+ .quad 0x3e2ec30c6e3e04a7
+ .quad 0x3e2705912d25b325
+ .quad 0x3e2dae1b8d59e849
+ .quad 0x3e423e2e1169656a
+ .quad 0x3e349d026e33d675
+ .quad 0x3e423c465e6976da
+ .quad 0x3e366c977e236c73
+ .quad 0x3e44fec0a13af881
+ .quad 0x3e3bdefbd14a0816
+ .quad 0x3e42fe3e91c348e4
+ .quad 0x3e4fc0c868ccc02d
+ .quad 0x3e3ce20a829051bb
+ .quad 0x3e47f10cf32e6bba
+ .quad 0x3e43cf2061568859
+ .quad 0x3e484995cb804b94
+ .quad 0x3e4a52b6acfcfdca
+ .quad 0x3e3b291ecf4dff1e
+ .quad 0x3e21d2c3e64ae851
+ .quad 0x3e4017e4faa42b7d
+ .quad 0x3de975077f1f5f0c
+ .quad 0x3e20327dc8093a52
+ .quad 0x3e3108d9313aec65
+ .quad 0x3e4a12e5301be44a
+ .quad 0x3e1e754d20c519e1
+ .quad 0x3e3f456f394f9727
+ .quad 0x3e29471103e8f00d
+ .quad 0x3e3ef3150343f8df
+ .quad 0x3e41960d9d9c3263
+ .quad 0x3e4204d0144751b3
+ .quad 0x3e4507ff357398fe
+ .quad 0x3e4dc9937fc8cafd
+ .quad 0x3e572f32fe672868
+ .quad 0x3e53e49d647d323e
+ .quad 0x3e33fb81ea92d9e0
+ .quad 0x3e43e387ef003635
+ .quad 0x3e1ac754cb104aea
+ .quad 0x3e4535f0444ebaaf
+ .quad 0x3e253c8ea7b1cdda
+ .quad 0x3e3cf0c0396a568b
+ .quad 0x3e5543ca873c2b4a
+ .quad 0x3e425780181e2b37
+ .quad 0x3e5ee52ed49d71d2
+ .quad 0x3e51e64842e2c386
+ .quad 0x3e5d2ba01bc76a27
+ .quad 0x3e5b39774c30f499
+ .quad 0x3e38740932120aea
+ .quad 0x3e576dab3462a1e8
+ .quad 0x3e409c9f20203b31
+ .quad 0x3e516e7a08ad0d1a
+ .quad 0x3e46172fe015e13b
+ .quad 0x3e49e4558147cf67
+ .quad 0x3e4cfdeb43cfd005
+ .quad 0x3e3a809c03254a71
+ .quad 0x3e47acfc98509e33
+ .quad 0x3e54366de473e474
+ .quad 0x3e5569394d90d724
+ .quad 0x3e32b83ec743664c
+ .quad 0x3e56db22c4808ee5
+ .quad 0x3df7ae84940df0e1
+ .quad 0x3e554042cd999564
+ .quad 0x3e4242b8488b3056
+ .quad 0x3e4e7dc059ab8a9e
+ .quad 0x3e5a71e977d7da5f
+ .quad 0x3e5d30d552ce0ec3
+ .quad 0x3e43208592b6c6b7
+ .quad 0x3e51440e7149afff
+ .quad 0x3e36812c371a1c87
+ .quad 0x3e579a3715fc9257
+ .quad 0x3e57c92f2af8b0ca
+ .quad 0x3e56679d8894dbdf
+ .quad 0x3e2a9f33e77507f0
+ .quad 0x3e4c22a3e377a524
+ .quad 0x3e3723c84a77a4dc
+ .quad 0x3e594a871b636194
+ .quad 0x3e570d6058f62f4d
+ .quad 0x3e4a6274cf0e362f
+ .quad 0x3e42fe3570af1a0b
+ .quad 0x3e596a286955d67e
+ .quad 0x3e442104f127091e
+ .quad 0x3e407826bae32c6b
+ .quad 0x3df8d8844ce77237
+ .quad 0x3e5eaa609080d4b4
+ .quad 0x3e4dc66fbe61efc4
+ .quad 0x3e5c8f11979a5db6
+ .quad 0x3e52dedf0e6f1770
+ .quad 0x3e5cb41e1410132a
+ .quad 0x3e32866d705c553d
+ .quad 0x3e54ec3293b2fbe0
+ .quad 0x3e578b8c2f4d0fe1
+ .quad 0x3e562ad8f7ca2cff
+ .quad 0x3e5a298b5f973a2c
+ .quad 0x3e49381d4f1b95e0
+ .quad 0x3e564c7bdb9bc56c
+ .quad 0x3e5fbb4caef790fc
+ .quad 0x3e51200c3f899927
+ .quad 0x3e526a05c813d56e
+ .quad 0x3e4681e2910108ee
+ .quad 0x3e282cf15d12ecd7
+ .quad 0x3e0a537e32446892
+ .quad 0x3e46f9c1cb6f7010
+ .quad 0x3e4328ddcedf39d8
+ .quad 0x3e164f64c210df9d
+ .quad 0x3e58f676e17cc811
+ .quad 0x3e560ddf1680dd45
+ .quad 0x3e5e2da951c2d91b
+ .quad 0x3e5696777b66d115
+ .quad 0x3e311eb3043f5601
+ .quad 0x3e48000b33f90fd4
+ .quad 0x3e523e2e1169656a
+ .quad 0x3e5b41565d3990cb
+ .quad 0x3e46138b8d9d31e6
+ .quad 0x3e3565afaf7f6248
+ .quad 0x3e4b68e0ba153594
+ .quad 0x3e3d87027ef4ab9a
+ .quad 0x3e556b9c99085939
+ .quad 0x3e5aa02166cccab2
+ .quad 0x3e5991d2aca399a1
+ .quad 0x3e54982259cc625d
+ .quad 0x3e4b9feddaab9820
+ .quad 0x3e3c70c0f683cc68
+ .quad 0x3e213156425e67e5
+ .quad 0x3df79063deab051f
+ .quad 0x3e27e2744b2b8ca5
+ .quad 0x3e4600534df378df
+ .quad 0x3e59322676507a79
+ .quad 0x3e4c4720cb4558b5
+ .quad 0x3e445e4b56add63a
+ .quad 0x3e4af321af5e9bb5
+ .quad 0x3e57f1e1148dad64
+ .quad 0x3e42a4022f65e2e6
+ .quad 0x3e11f2ccbcd0d3cc
+ .quad 0x3e5eaa65b49696e2
+ .quad 0x3e110e6123a74764
+ .quad 0x3e3cf24b2077c3f6
+ .quad 0x3e4fc8d8164754da
+ .quad 0x3e598cfcdb6a2dbc
+ .quad 0x3e24464a6bcdf47b
+ .quad 0x3e41f1774d8b66a6
+ .quad 0x3e459920a2adf6fa
+ .quad 0x3e370d02a99b4c5a
+ .quad 0x3e576b6cafa2532d
+ .quad 0x3e5d23c38ec17936
+ .quad 0x3e541b6b1b0e66c4
+ .quad 0x3e5952662c6bfdc7
+ .quad 0x3e4333f3d6bb35ec
+ .quad 0x3e195120d8486e92
+ .quad 0x3e5db8a405fac56e
+ .quad 0x3e5a4c112ce6312e
+ .quad 0x3e536987e1924e45
+ .quad 0x3e33f98ea94bc1bd
+ .quad 0x3e459718aacb6ec7
+ .quad 0x3df975077f1f5f0c
+ .quad 0x3e13654d88f20500
+ .quad 0x3e40f598530f101b
+ .quad 0x3e5145f6c94f7fd7
+ .quad 0x3e567fead8bcce75
+ .quad 0x3e52e67148cd0a7b
+ .quad 0x3e10d5e5897de907
+ .quad 0x3e5b5ee92c53d919
+ .quad 0x3e5c1c02803f7554
+ .quad 0x3e5d5caa7a35c9f7
+ .quad 0x3e5910459cac3223
+ .quad 0x3e41fbe1bb98afdf
+ .quad 0x3e3b135395510d1e
+ .quad 0x3e47b8f0e7b8e757
+ .quad 0x3e519511f61a96b8
+ .quad 0x3e5117d846ae1f8e
+ .quad 0x3e2b3a9507d6dc1f
+ .quad 0x3e15fa7c78c9e676
+ .quad 0x3e2db76303b21928
+ .quad 0x3e27eb8450ac22ed
+ .quad 0x3e579e0caa9c9ab7
+ .quad 0x3e59de6d7cba1bbe
+ .quad 0x3e1df5f5baf436cb
+ .quad 0x3e3e746344728dbf
+ .quad 0x3e277c23362928b9
+ .quad 0x3e4715137cfeba9f
+ .quad 0x3e58fe55f2856443
+ .quad 0x3e25bd1a025d9e24
+ .quad 0x0000000000000000
+
+.align 16
+.L__log_F_inv:
+ .quad 0x4000000000000000
+ .quad 0x3fffe01fe01fe020
+ .quad 0x3fffc07f01fc07f0
+ .quad 0x3fffa11caa01fa12
+ .quad 0x3fff81f81f81f820
+ .quad 0x3fff6310aca0dbb5
+ .quad 0x3fff44659e4a4271
+ .quad 0x3fff25f644230ab5
+ .quad 0x3fff07c1f07c1f08
+ .quad 0x3ffee9c7f8458e02
+ .quad 0x3ffecc07b301ecc0
+ .quad 0x3ffeae807aba01eb
+ .quad 0x3ffe9131abf0b767
+ .quad 0x3ffe741aa59750e4
+ .quad 0x3ffe573ac901e574
+ .quad 0x3ffe3a9179dc1a73
+ .quad 0x3ffe1e1e1e1e1e1e
+ .quad 0x3ffe01e01e01e01e
+ .quad 0x3ffde5d6e3f8868a
+ .quad 0x3ffdca01dca01dca
+ .quad 0x3ffdae6076b981db
+ .quad 0x3ffd92f2231e7f8a
+ .quad 0x3ffd77b654b82c34
+ .quad 0x3ffd5cac807572b2
+ .quad 0x3ffd41d41d41d41d
+ .quad 0x3ffd272ca3fc5b1a
+ .quad 0x3ffd0cb58f6ec074
+ .quad 0x3ffcf26e5c44bfc6
+ .quad 0x3ffcd85689039b0b
+ .quad 0x3ffcbe6d9601cbe7
+ .quad 0x3ffca4b3055ee191
+ .quad 0x3ffc8b265afb8a42
+ .quad 0x3ffc71c71c71c71c
+ .quad 0x3ffc5894d10d4986
+ .quad 0x3ffc3f8f01c3f8f0
+ .quad 0x3ffc26b5392ea01c
+ .quad 0x3ffc0e070381c0e0
+ .quad 0x3ffbf583ee868d8b
+ .quad 0x3ffbdd2b899406f7
+ .quad 0x3ffbc4fd65883e7b
+ .quad 0x3ffbacf914c1bad0
+ .quad 0x3ffb951e2b18ff23
+ .quad 0x3ffb7d6c3dda338b
+ .quad 0x3ffb65e2e3beee05
+ .quad 0x3ffb4e81b4e81b4f
+ .quad 0x3ffb37484ad806ce
+ .quad 0x3ffb2036406c80d9
+ .quad 0x3ffb094b31d922a4
+ .quad 0x3ffaf286bca1af28
+ .quad 0x3ffadbe87f94905e
+ .quad 0x3ffac5701ac5701b
+ .quad 0x3ffaaf1d2f87ebfd
+ .quad 0x3ffa98ef606a63be
+ .quad 0x3ffa82e65130e159
+ .quad 0x3ffa6d01a6d01a6d
+ .quad 0x3ffa574107688a4a
+ .quad 0x3ffa41a41a41a41a
+ .quad 0x3ffa2c2a87c51ca0
+ .quad 0x3ffa16d3f97a4b02
+ .quad 0x3ffa01a01a01a01a
+ .quad 0x3ff9ec8e951033d9
+ .quad 0x3ff9d79f176b682d
+ .quad 0x3ff9c2d14ee4a102
+ .quad 0x3ff9ae24ea5510da
+ .quad 0x3ff999999999999a
+ .quad 0x3ff9852f0d8ec0ff
+ .quad 0x3ff970e4f80cb872
+ .quad 0x3ff95cbb0be377ae
+ .quad 0x3ff948b0fcd6e9e0
+ .quad 0x3ff934c67f9b2ce6
+ .quad 0x3ff920fb49d0e229
+ .quad 0x3ff90d4f120190d5
+ .quad 0x3ff8f9c18f9c18fa
+ .quad 0x3ff8e6527af1373f
+ .quad 0x3ff8d3018d3018d3
+ .quad 0x3ff8bfce8062ff3a
+ .quad 0x3ff8acb90f6bf3aa
+ .quad 0x3ff899c0f601899c
+ .quad 0x3ff886e5f0abb04a
+ .quad 0x3ff87427bcc092b9
+ .quad 0x3ff8618618618618
+ .quad 0x3ff84f00c2780614
+ .quad 0x3ff83c977ab2bedd
+ .quad 0x3ff82a4a0182a4a0
+ .quad 0x3ff8181818181818
+ .quad 0x3ff8060180601806
+ .quad 0x3ff7f405fd017f40
+ .quad 0x3ff7e225515a4f1d
+ .quad 0x3ff7d05f417d05f4
+ .quad 0x3ff7beb3922e017c
+ .quad 0x3ff7ad2208e0ecc3
+ .quad 0x3ff79baa6bb6398b
+ .quad 0x3ff78a4c8178a4c8
+ .quad 0x3ff77908119ac60d
+ .quad 0x3ff767dce434a9b1
+ .quad 0x3ff756cac201756d
+ .quad 0x3ff745d1745d1746
+ .quad 0x3ff734f0c541fe8d
+ .quad 0x3ff724287f46debc
+ .quad 0x3ff713786d9c7c09
+ .quad 0x3ff702e05c0b8170
+ .quad 0x3ff6f26016f26017
+ .quad 0x3ff6e1f76b4337c7
+ .quad 0x3ff6d1a62681c861
+ .quad 0x3ff6c16c16c16c17
+ .quad 0x3ff6b1490aa31a3d
+ .quad 0x3ff6a13cd1537290
+ .quad 0x3ff691473a88d0c0
+ .quad 0x3ff6816816816817
+ .quad 0x3ff6719f3601671a
+ .quad 0x3ff661ec6a5122f9
+ .quad 0x3ff6524f853b4aa3
+ .quad 0x3ff642c8590b2164
+ .quad 0x3ff63356b88ac0de
+ .quad 0x3ff623fa77016240
+ .quad 0x3ff614b36831ae94
+ .quad 0x3ff6058160581606
+ .quad 0x3ff5f66434292dfc
+ .quad 0x3ff5e75bb8d015e7
+ .quad 0x3ff5d867c3ece2a5
+ .quad 0x3ff5c9882b931057
+ .quad 0x3ff5babcc647fa91
+ .quad 0x3ff5ac056b015ac0
+ .quad 0x3ff59d61f123ccaa
+ .quad 0x3ff58ed2308158ed
+ .quad 0x3ff5805601580560
+ .quad 0x3ff571ed3c506b3a
+ .quad 0x3ff56397ba7c52e2
+ .quad 0x3ff5555555555555
+ .quad 0x3ff54725e6bb82fe
+ .quad 0x3ff5390948f40feb
+ .quad 0x3ff52aff56a8054b
+ .quad 0x3ff51d07eae2f815
+ .quad 0x3ff50f22e111c4c5
+ .quad 0x3ff5015015015015
+ .quad 0x3ff4f38f62dd4c9b
+ .quad 0x3ff4e5e0a72f0539
+ .quad 0x3ff4d843bedc2c4c
+ .quad 0x3ff4cab88725af6e
+ .quad 0x3ff4bd3edda68fe1
+ .quad 0x3ff4afd6a052bf5b
+ .quad 0x3ff4a27fad76014a
+ .quad 0x3ff49539e3b2d067
+ .quad 0x3ff4880522014880
+ .quad 0x3ff47ae147ae147b
+ .quad 0x3ff46dce34596066
+ .quad 0x3ff460cbc7f5cf9a
+ .quad 0x3ff453d9e2c776ca
+ .quad 0x3ff446f86562d9fb
+ .quad 0x3ff43a2730abee4d
+ .quad 0x3ff42d6625d51f87
+ .quad 0x3ff420b5265e5951
+ .quad 0x3ff4141414141414
+ .quad 0x3ff40782d10e6566
+ .quad 0x3ff3fb013fb013fb
+ .quad 0x3ff3ee8f42a5af07
+ .quad 0x3ff3e22cbce4a902
+ .quad 0x3ff3d5d991aa75c6
+ .quad 0x3ff3c995a47babe7
+ .quad 0x3ff3bd60d9232955
+ .quad 0x3ff3b13b13b13b14
+ .quad 0x3ff3a524387ac822
+ .quad 0x3ff3991c2c187f63
+ .quad 0x3ff38d22d366088e
+ .quad 0x3ff3813813813814
+ .quad 0x3ff3755bd1c945ee
+ .quad 0x3ff3698df3de0748
+ .quad 0x3ff35dce5f9f2af8
+ .quad 0x3ff3521cfb2b78c1
+ .quad 0x3ff34679ace01346
+ .quad 0x3ff33ae45b57bcb2
+ .quad 0x3ff32f5ced6a1dfa
+ .quad 0x3ff323e34a2b10bf
+ .quad 0x3ff3187758e9ebb6
+ .quad 0x3ff30d190130d190
+ .quad 0x3ff301c82ac40260
+ .quad 0x3ff2f684bda12f68
+ .quad 0x3ff2eb4ea1fed14b
+ .quad 0x3ff2e025c04b8097
+ .quad 0x3ff2d50a012d50a0
+ .quad 0x3ff2c9fb4d812ca0
+ .quad 0x3ff2bef98e5a3711
+ .quad 0x3ff2b404ad012b40
+ .quad 0x3ff2a91c92f3c105
+ .quad 0x3ff29e4129e4129e
+ .quad 0x3ff293725bb804a5
+ .quad 0x3ff288b01288b013
+ .quad 0x3ff27dfa38a1ce4d
+ .quad 0x3ff27350b8812735
+ .quad 0x3ff268b37cd60127
+ .quad 0x3ff25e22708092f1
+ .quad 0x3ff2539d7e9177b2
+ .quad 0x3ff2492492492492
+ .quad 0x3ff23eb79717605b
+ .quad 0x3ff23456789abcdf
+ .quad 0x3ff22a0122a0122a
+ .quad 0x3ff21fb78121fb78
+ .quad 0x3ff21579804855e6
+ .quad 0x3ff20b470c67c0d9
+ .quad 0x3ff2012012012012
+ .quad 0x3ff1f7047dc11f70
+ .quad 0x3ff1ecf43c7fb84c
+ .quad 0x3ff1e2ef3b3fb874
+ .quad 0x3ff1d8f5672e4abd
+ .quad 0x3ff1cf06ada2811d
+ .quad 0x3ff1c522fc1ce059
+ .quad 0x3ff1bb4a4046ed29
+ .quad 0x3ff1b17c67f2bae3
+ .quad 0x3ff1a7b9611a7b96
+ .quad 0x3ff19e0119e0119e
+ .quad 0x3ff19453808ca29c
+ .quad 0x3ff18ab083902bdb
+ .quad 0x3ff1811811811812
+ .quad 0x3ff1778a191bd684
+ .quad 0x3ff16e0689427379
+ .quad 0x3ff1648d50fc3201
+ .quad 0x3ff15b1e5f75270d
+ .quad 0x3ff151b9a3fdd5c9
+ .quad 0x3ff1485f0e0acd3b
+ .quad 0x3ff13f0e8d344724
+ .quad 0x3ff135c81135c811
+ .quad 0x3ff12c8b89edc0ac
+ .quad 0x3ff12358e75d3033
+ .quad 0x3ff11a3019a74826
+ .quad 0x3ff1111111111111
+ .quad 0x3ff107fbbe011080
+ .quad 0x3ff0fef010fef011
+ .quad 0x3ff0f5edfab325a2
+ .quad 0x3ff0ecf56be69c90
+ .quad 0x3ff0e40655826011
+ .quad 0x3ff0db20a88f4696
+ .quad 0x3ff0d24456359e3a
+ .quad 0x3ff0c9714fbcda3b
+ .quad 0x3ff0c0a7868b4171
+ .quad 0x3ff0b7e6ec259dc8
+ .quad 0x3ff0af2f722eecb5
+ .quad 0x3ff0a6810a6810a7
+ .quad 0x3ff09ddba6af8360
+ .quad 0x3ff0953f39010954
+ .quad 0x3ff08cabb37565e2
+ .quad 0x3ff0842108421084
+ .quad 0x3ff07b9f29b8eae2
+ .quad 0x3ff073260a47f7c6
+ .quad 0x3ff06ab59c7912fb
+ .quad 0x3ff0624dd2f1a9fc
+ .quad 0x3ff059eea0727586
+ .quad 0x3ff05197f7d73404
+ .quad 0x3ff04949cc1664c5
+ .quad 0x3ff0410410410410
+ .quad 0x3ff038c6b78247fc
+ .quad 0x3ff03091b51f5e1a
+ .quad 0x3ff02864fc7729e9
+ .quad 0x3ff0204081020408
+ .quad 0x3ff0182436517a37
+ .quad 0x3ff0101010101010
+ .quad 0x3ff0080402010080
+ .quad 0x3ff0000000000000
+ .quad 0x0000000000000000
+
+
diff --git a/src/gas/log2f.S b/src/gas/log2f.S
new file mode 100644
index 0000000..5361e0f
--- /dev/null
+++ b/src/gas/log2f.S
@@ -0,0 +1,738 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# log2f.S
+#
+# An implementation of the log2f libm function.
+#
+# Prototype:
+#
+# float log2f(float x);
+#
+
+#
+# Algorithm:
+# Similar to one presnted in log.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(log2f)
+#define fname_special _log2f_special@PLT
+
+
+# local variable storage offsets
+.equ p_temp, 0x0
+.equ stack_size, 0x18
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ sub $stack_size, %rsp
+
+ # compute exponent part
+ xor %eax, %eax
+ movdqa %xmm0, %xmm3
+ movss %xmm0, %xmm4
+ psrld $23, %xmm3
+ movd %xmm0, %eax
+ psubd .L__mask_127(%rip), %xmm3
+ movdqa %xmm0, %xmm2
+ cvtdq2ps %xmm3, %xmm5 # xexp
+
+ # NaN or inf
+ movdqa %xmm0, %xmm1
+ andps .L__real_inf(%rip), %xmm1
+ comiss .L__real_inf(%rip), %xmm1
+ je .L__x_is_inf_or_nan
+
+ # check for negative numbers or zero
+ xorps %xmm1, %xmm1
+ comiss %xmm1, %xmm0
+ jbe .L__x_is_zero_or_neg
+
+ pand .L__real_mant(%rip), %xmm2
+ subss .L__real_one(%rip), %xmm4
+
+ comiss .L__real_neg127(%rip), %xmm5
+ je .L__denormal_adjust
+
+.L__continue_common:
+
+ # compute index into the log tables
+ mov %eax, %r9d
+ and .L__mask_mant_all7(%rip), %eax
+ and .L__mask_mant8(%rip), %r9d
+ shl $1, %r9d
+ add %r9d, %eax
+ mov %eax, p_temp(%rsp)
+
+ # near one codepath
+ andps .L__real_notsign(%rip), %xmm4
+ comiss .L__real_threshold(%rip), %xmm4
+ jb .L__near_one
+
+ # F, Y
+ movss p_temp(%rsp), %xmm1
+ shr $16, %eax
+ por .L__real_half(%rip), %xmm2
+ por .L__real_half(%rip), %xmm1
+ lea .L__log_F_inv(%rip), %r9
+
+ # f = F - Y, r = f * inv
+ subss %xmm2, %xmm1
+ mulss (%r9,%rax,4), %xmm1
+
+ movss %xmm1, %xmm2
+ movss %xmm1, %xmm0
+
+ # poly
+ mulss .L__real_1_over_3(%rip), %xmm2
+ mulss %xmm1, %xmm0
+ addss .L__real_1_over_2(%rip), %xmm2
+
+ lea .L__log_128_tail(%rip), %r9
+ lea .L__log_128_lead(%rip), %r10
+
+ mulss %xmm0, %xmm2
+ movss (%r9,%rax,4), %xmm3
+ addss %xmm2, %xmm1
+
+ mulss .L__real_log2_e(%rip), %xmm1
+
+ # m + log2(G) - poly*log2_e
+ subss %xmm1, %xmm3
+ movss %xmm3, %xmm0
+ addss (%r10,%rax,4), %xmm5
+ addss %xmm5, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__near_one:
+ # r = x - 1.0#
+ movss .L__real_two(%rip), %xmm2
+ subss .L__real_one(%rip), %xmm0
+
+ # u = r / (2.0 + r)
+ addss %xmm0, %xmm2
+ movss %xmm0, %xmm1
+ divss %xmm2, %xmm1 # u
+
+ # correction = r * u
+ movss %xmm0, %xmm4
+ mulss %xmm1, %xmm4
+
+ # u = u + u#
+ addss %xmm1, %xmm1
+ movss %xmm1, %xmm2
+ mulss %xmm2, %xmm2 # v = u^2
+
+ # r2 = (u * v * (ca_1 + v * ca_2) - correction)
+ movss %xmm1, %xmm3
+ mulss %xmm2, %xmm3 # u^3
+ mulss .L__real_ca2(%rip), %xmm2 # Bu^2
+ addss .L__real_ca1(%rip), %xmm2 # +A
+ mulss %xmm3, %xmm2
+ subss %xmm4, %xmm2 # -correction
+
+ movdqa %xmm0, %xmm5
+ pand .L__mask_lower(%rip), %xmm5
+ subss %xmm5, %xmm0
+ addss %xmm0, %xmm2
+
+ movss %xmm5, %xmm0
+ movss %xmm2, %xmm1
+
+ mulss .L__real_log2_e_tail(%rip), %xmm2
+ mulss .L__real_log2_e_tail(%rip), %xmm0
+ mulss .L__real_log2_e_lead(%rip), %xmm1
+ mulss .L__real_log2_e_lead(%rip), %xmm5
+
+ addss %xmm2, %xmm0
+ addss %xmm1, %xmm0
+ addss %xmm5, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__denormal_adjust:
+ por .L__real_one(%rip), %xmm2
+ subss .L__real_one(%rip), %xmm2
+ movdqa %xmm2, %xmm5
+ pand .L__real_mant(%rip), %xmm2
+ movd %xmm2, %eax
+ psrld $23, %xmm5
+ psubd .L__mask_253(%rip), %xmm5
+ cvtdq2ps %xmm5, %xmm5
+ jmp .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+ jne .L__x_is_neg
+
+ movss .L__real_ninf(%rip), %xmm1
+ mov .L__flag_x_zero(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+ movss .L__real_nan(%rip), %xmm1
+ mov .L__flag_x_neg(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+ cmp .L__real_inf(%rip), %eax
+ je .L__finish
+
+ cmp .L__real_ninf(%rip), %eax
+ je .L__x_is_neg
+
+ mov .L__real_qnanbit(%rip), %r9d
+ and %eax, %r9d
+ jnz .L__finish
+
+ or .L__real_qnanbit(%rip), %eax
+ movd %eax, %xmm1
+ mov .L__flag_x_nan(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__finish:
+ add $stack_size, %rsp
+ ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero: .long 00000001
+.L__flag_x_neg: .long 00000002
+.L__flag_x_nan: .long 00000003
+
+.align 16
+
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+.L__real_neg_qnan: .quad 0x0ffc00000ffc00000
+ .quad 0x0ffc00000ffc00000
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantissa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+
+.L__mask_mant_all7: .quad 0x00000000007f0000
+ .quad 0x00000000007f0000
+.L__mask_mant8: .quad 0x0000000000008000
+ .quad 0x0000000000008000
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+
+.L__real_log2_e_lead: .quad 0x03FB800003FB80000 # 1.4375000000
+ .quad 0x03FB800003FB80000
+.L__real_log2_e_tail: .quad 0x03BAA3B293BAA3B29 # 0.0051950408889633
+ .quad 0x03BAA3B293BAA3B29
+
+.L__real_log2_e: .quad 0x3fb8aa3b3fb8aa3b
+ .quad 0x0000000000000000
+
+.L__mask_lower: .quad 0x0ffff0000ffff0000
+ .quad 0x0ffff0000ffff0000
+
+.align 16
+
+.L__real_neg127: .long 0x0c2fe0000
+ .long 0
+ .quad 0
+
+.L__mask_253: .long 0x000000fd
+ .long 0
+ .quad 0
+
+.L__real_threshold: .long 0x3d800000
+ .long 0
+ .quad 0
+
+.L__mask_01: .long 0x00000001
+ .long 0
+ .quad 0
+
+.L__mask_80: .long 0x00000080
+ .long 0
+ .quad 0
+
+.L__real_3b800000: .long 0x3b800000
+ .long 0
+ .quad 0
+
+.L__real_1_over_3: .long 0x3eaaaaab
+ .long 0
+ .quad 0
+
+.L__real_1_over_2: .long 0x3f000000
+ .long 0
+ .quad 0
+
+.align 16
+.L__log_128_lead:
+ .long 0x00000000
+ .long 0x3c37c000
+ .long 0x3cb70000
+ .long 0x3d08c000
+ .long 0x3d35c000
+ .long 0x3d624000
+ .long 0x3d874000
+ .long 0x3d9d4000
+ .long 0x3db30000
+ .long 0x3dc8c000
+ .long 0x3dde4000
+ .long 0x3df38000
+ .long 0x3e044000
+ .long 0x3e0ec000
+ .long 0x3e194000
+ .long 0x3e238000
+ .long 0x3e2e0000
+ .long 0x3e380000
+ .long 0x3e424000
+ .long 0x3e4c4000
+ .long 0x3e564000
+ .long 0x3e604000
+ .long 0x3e6a4000
+ .long 0x3e740000
+ .long 0x3e7dc000
+ .long 0x3e83c000
+ .long 0x3e888000
+ .long 0x3e8d4000
+ .long 0x3e920000
+ .long 0x3e96c000
+ .long 0x3e9b8000
+ .long 0x3ea00000
+ .long 0x3ea4c000
+ .long 0x3ea94000
+ .long 0x3eae0000
+ .long 0x3eb28000
+ .long 0x3eb70000
+ .long 0x3ebb8000
+ .long 0x3ec00000
+ .long 0x3ec44000
+ .long 0x3ec8c000
+ .long 0x3ecd4000
+ .long 0x3ed18000
+ .long 0x3ed5c000
+ .long 0x3eda0000
+ .long 0x3ede8000
+ .long 0x3ee2c000
+ .long 0x3ee70000
+ .long 0x3eeb0000
+ .long 0x3eef4000
+ .long 0x3ef38000
+ .long 0x3ef78000
+ .long 0x3efbc000
+ .long 0x3effc000
+ .long 0x3f01c000
+ .long 0x3f040000
+ .long 0x3f060000
+ .long 0x3f080000
+ .long 0x3f0a0000
+ .long 0x3f0c0000
+ .long 0x3f0dc000
+ .long 0x3f0fc000
+ .long 0x3f11c000
+ .long 0x3f13c000
+ .long 0x3f15c000
+ .long 0x3f178000
+ .long 0x3f198000
+ .long 0x3f1b4000
+ .long 0x3f1d4000
+ .long 0x3f1f0000
+ .long 0x3f210000
+ .long 0x3f22c000
+ .long 0x3f24c000
+ .long 0x3f268000
+ .long 0x3f288000
+ .long 0x3f2a4000
+ .long 0x3f2c0000
+ .long 0x3f2dc000
+ .long 0x3f2f8000
+ .long 0x3f318000
+ .long 0x3f334000
+ .long 0x3f350000
+ .long 0x3f36c000
+ .long 0x3f388000
+ .long 0x3f3a4000
+ .long 0x3f3c0000
+ .long 0x3f3dc000
+ .long 0x3f3f8000
+ .long 0x3f414000
+ .long 0x3f42c000
+ .long 0x3f448000
+ .long 0x3f464000
+ .long 0x3f480000
+ .long 0x3f498000
+ .long 0x3f4b4000
+ .long 0x3f4d0000
+ .long 0x3f4e8000
+ .long 0x3f504000
+ .long 0x3f51c000
+ .long 0x3f538000
+ .long 0x3f550000
+ .long 0x3f56c000
+ .long 0x3f584000
+ .long 0x3f5a0000
+ .long 0x3f5b8000
+ .long 0x3f5d0000
+ .long 0x3f5ec000
+ .long 0x3f604000
+ .long 0x3f61c000
+ .long 0x3f638000
+ .long 0x3f650000
+ .long 0x3f668000
+ .long 0x3f680000
+ .long 0x3f698000
+ .long 0x3f6b0000
+ .long 0x3f6cc000
+ .long 0x3f6e4000
+ .long 0x3f6fc000
+ .long 0x3f714000
+ .long 0x3f72c000
+ .long 0x3f744000
+ .long 0x3f75c000
+ .long 0x3f770000
+ .long 0x3f788000
+ .long 0x3f7a0000
+ .long 0x3f7b8000
+ .long 0x3f7d0000
+ .long 0x3f7e8000
+ .long 0x3f800000
+
+.align 16
+.L__log_128_tail:
+ .long 0x00000000
+ .long 0x374a16dd
+ .long 0x37f2d0b8
+ .long 0x381a3aa2
+ .long 0x37b4dd63
+ .long 0x383f5721
+ .long 0x384e27e8
+ .long 0x380bf749
+ .long 0x387dbeb2
+ .long 0x37216e46
+ .long 0x3684815b
+ .long 0x383b045f
+ .long 0x390b119b
+ .long 0x391a32ea
+ .long 0x38ba789e
+ .long 0x39553f30
+ .long 0x3651cfde
+ .long 0x39685a9d
+ .long 0x39057a05
+ .long 0x395ba0ef
+ .long 0x396bc5b6
+ .long 0x3936d9bb
+ .long 0x38772619
+ .long 0x39017ce9
+ .long 0x3902d720
+ .long 0x38856dd8
+ .long 0x3941f6b4
+ .long 0x3980b652
+ .long 0x3980f561
+ .long 0x39443f13
+ .long 0x38926752
+ .long 0x39c8c763
+ .long 0x391e12f3
+ .long 0x39b7bf89
+ .long 0x36d1cfde
+ .long 0x38c7f233
+ .long 0x39087367
+ .long 0x38e95d3f
+ .long 0x38256316
+ .long 0x39d38e5c
+ .long 0x396ea247
+ .long 0x350e4788
+ .long 0x395d829f
+ .long 0x39c30f2f
+ .long 0x39fd7ee7
+ .long 0x3872e9e7
+ .long 0x3897d694
+ .long 0x3824923a
+ .long 0x39ea7c06
+ .long 0x39a7fa88
+ .long 0x391aa879
+ .long 0x39dace65
+ .long 0x39215a32
+ .long 0x39af3350
+ .long 0x3a7b5172
+ .long 0x389cf27f
+ .long 0x3902806b
+ .long 0x3909d8a9
+ .long 0x38c9faa1
+ .long 0x37a33dca
+ .long 0x3a6623d2
+ .long 0x3a3c7a61
+ .long 0x3a083a84
+ .long 0x39930161
+ .long 0x35d1cfde
+ .long 0x3a2d0ebd
+ .long 0x399f1aad
+ .long 0x3a67ff6d
+ .long 0x39ecfea8
+ .long 0x3a7b26f3
+ .long 0x39ec1fa6
+ .long 0x3a675314
+ .long 0x399e12f3
+ .long 0x3a2d4b66
+ .long 0x370c3845
+ .long 0x399ba329
+ .long 0x3a1044d3
+ .long 0x3a49a196
+ .long 0x3a79fe83
+ .long 0x3905c7aa
+ .long 0x39802391
+ .long 0x39abe796
+ .long 0x39c65a9d
+ .long 0x39cfa6c5
+ .long 0x39c7f593
+ .long 0x39af6ff7
+ .long 0x39863e4d
+ .long 0x391910c1
+ .long 0x369d5be7
+ .long 0x3a541616
+ .long 0x3a1ee960
+ .long 0x39c38ed2
+ .long 0x38e61600
+ .long 0x3a4fedb4
+ .long 0x39f6b4ab
+ .long 0x38f8d3b0
+ .long 0x3a3b3faa
+ .long 0x399fb693
+ .long 0x3a5cfe71
+ .long 0x39c5740b
+ .long 0x3a611eb0
+ .long 0x39b079c4
+ .long 0x3a4824d7
+ .long 0x39439a54
+ .long 0x3a1291ea
+ .long 0x3a6d3673
+ .long 0x3981c731
+ .long 0x3a0da88f
+ .long 0x3a53945c
+ .long 0x3895ae91
+ .long 0x3996372a
+ .long 0x39f9a832
+ .long 0x3a27eda4
+ .long 0x3a4c764f
+ .long 0x3a6a7c06
+ .long 0x370321eb
+ .long 0x3899ab3f
+ .long 0x38f02086
+ .long 0x390a1707
+ .long 0x39031e44
+ .long 0x38c6b362
+ .long 0x382bf195
+ .long 0x3a768e36
+ .long 0x3a5c503b
+ .long 0x3a3c1179
+ .long 0x3a15de1d
+ .long 0x39d3845d
+ .long 0x395f263f
+ .long 0x00000000
+
+.align 16
+.L__log_F_inv:
+ .long 0x40000000
+ .long 0x3ffe03f8
+ .long 0x3ffc0fc1
+ .long 0x3ffa232d
+ .long 0x3ff83e10
+ .long 0x3ff6603e
+ .long 0x3ff4898d
+ .long 0x3ff2b9d6
+ .long 0x3ff0f0f1
+ .long 0x3fef2eb7
+ .long 0x3fed7304
+ .long 0x3febbdb3
+ .long 0x3fea0ea1
+ .long 0x3fe865ac
+ .long 0x3fe6c2b4
+ .long 0x3fe52598
+ .long 0x3fe38e39
+ .long 0x3fe1fc78
+ .long 0x3fe07038
+ .long 0x3fdee95c
+ .long 0x3fdd67c9
+ .long 0x3fdbeb62
+ .long 0x3fda740e
+ .long 0x3fd901b2
+ .long 0x3fd79436
+ .long 0x3fd62b81
+ .long 0x3fd4c77b
+ .long 0x3fd3680d
+ .long 0x3fd20d21
+ .long 0x3fd0b6a0
+ .long 0x3fcf6475
+ .long 0x3fce168a
+ .long 0x3fcccccd
+ .long 0x3fcb8728
+ .long 0x3fca4588
+ .long 0x3fc907da
+ .long 0x3fc7ce0c
+ .long 0x3fc6980c
+ .long 0x3fc565c8
+ .long 0x3fc43730
+ .long 0x3fc30c31
+ .long 0x3fc1e4bc
+ .long 0x3fc0c0c1
+ .long 0x3fbfa030
+ .long 0x3fbe82fa
+ .long 0x3fbd6910
+ .long 0x3fbc5264
+ .long 0x3fbb3ee7
+ .long 0x3fba2e8c
+ .long 0x3fb92144
+ .long 0x3fb81703
+ .long 0x3fb70fbb
+ .long 0x3fb60b61
+ .long 0x3fb509e7
+ .long 0x3fb40b41
+ .long 0x3fb30f63
+ .long 0x3fb21643
+ .long 0x3fb11fd4
+ .long 0x3fb02c0b
+ .long 0x3faf3ade
+ .long 0x3fae4c41
+ .long 0x3fad602b
+ .long 0x3fac7692
+ .long 0x3fab8f6a
+ .long 0x3faaaaab
+ .long 0x3fa9c84a
+ .long 0x3fa8e83f
+ .long 0x3fa80a81
+ .long 0x3fa72f05
+ .long 0x3fa655c4
+ .long 0x3fa57eb5
+ .long 0x3fa4a9cf
+ .long 0x3fa3d70a
+ .long 0x3fa3065e
+ .long 0x3fa237c3
+ .long 0x3fa16b31
+ .long 0x3fa0a0a1
+ .long 0x3f9fd80a
+ .long 0x3f9f1166
+ .long 0x3f9e4cad
+ .long 0x3f9d89d9
+ .long 0x3f9cc8e1
+ .long 0x3f9c09c1
+ .long 0x3f9b4c70
+ .long 0x3f9a90e8
+ .long 0x3f99d723
+ .long 0x3f991f1a
+ .long 0x3f9868c8
+ .long 0x3f97b426
+ .long 0x3f97012e
+ .long 0x3f964fda
+ .long 0x3f95a025
+ .long 0x3f94f209
+ .long 0x3f944581
+ .long 0x3f939a86
+ .long 0x3f92f114
+ .long 0x3f924925
+ .long 0x3f91a2b4
+ .long 0x3f90fdbc
+ .long 0x3f905a38
+ .long 0x3f8fb824
+ .long 0x3f8f177a
+ .long 0x3f8e7835
+ .long 0x3f8dda52
+ .long 0x3f8d3dcb
+ .long 0x3f8ca29c
+ .long 0x3f8c08c1
+ .long 0x3f8b7034
+ .long 0x3f8ad8f3
+ .long 0x3f8a42f8
+ .long 0x3f89ae41
+ .long 0x3f891ac7
+ .long 0x3f888889
+ .long 0x3f87f781
+ .long 0x3f8767ab
+ .long 0x3f86d905
+ .long 0x3f864b8a
+ .long 0x3f85bf37
+ .long 0x3f853408
+ .long 0x3f84a9fa
+ .long 0x3f842108
+ .long 0x3f839930
+ .long 0x3f83126f
+ .long 0x3f828cc0
+ .long 0x3f820821
+ .long 0x3f81848e
+ .long 0x3f810204
+ .long 0x3f808081
+ .long 0x3f800000
+
+
diff --git a/src/gas/logf.S b/src/gas/logf.S
new file mode 100644
index 0000000..4cee0b0
--- /dev/null
+++ b/src/gas/logf.S
@@ -0,0 +1,725 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# logf.S
+#
+# An implementation of the logf libm function.
+#
+# Prototype:
+#
+# float logf(float x);
+#
+
+#
+# Algorithm:
+# Similar to one presnted in log.S
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(logf)
+#define fname_special _logf_special@PLT
+
+
+# local variable storage offsets
+.equ p_temp, 0x0
+.equ stack_size, 0x18
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ sub $stack_size, %rsp
+
+ # compute exponent part
+ xor %eax, %eax
+ movdqa %xmm0, %xmm3
+ movss %xmm0, %xmm4
+ psrld $23, %xmm3
+ movd %xmm0, %eax
+ psubd .L__mask_127(%rip), %xmm3
+ movdqa %xmm0, %xmm2
+ cvtdq2ps %xmm3, %xmm5 # xexp
+
+ # NaN or inf
+ movdqa %xmm0, %xmm1
+ andps .L__real_inf(%rip), %xmm1
+ comiss .L__real_inf(%rip), %xmm1
+ je .L__x_is_inf_or_nan
+
+ # check for negative numbers or zero
+ xorps %xmm1, %xmm1
+ comiss %xmm1, %xmm0
+ jbe .L__x_is_zero_or_neg
+
+ pand .L__real_mant(%rip), %xmm2
+ subss .L__real_one(%rip), %xmm4
+
+ comiss .L__real_neg127(%rip), %xmm5
+ je .L__denormal_adjust
+
+.L__continue_common:
+
+ # compute the index into the log tables
+ mov %eax, %r9d
+ and .L__mask_mant_all7(%rip), %eax
+ and .L__mask_mant8(%rip), %r9d
+ shl $1, %r9d
+ add %r9d, %eax
+ mov %eax, p_temp(%rsp)
+
+ # check e as a special case
+ comiss .L__real_ef(%rip), %xmm0
+ je .L__logf_e
+
+ # near one codepath
+ andps .L__real_notsign(%rip), %xmm4
+ comiss .L__real_threshold(%rip), %xmm4
+ jb .L__near_one
+
+ # F, Y
+ movss p_temp(%rsp), %xmm1
+ shr $16, %eax
+ por .L__real_half(%rip), %xmm2
+ por .L__real_half(%rip), %xmm1
+ lea .L__log_F_inv(%rip), %r9
+
+ # f = F - Y, r = f * inv
+ subss %xmm2, %xmm1
+ mulss (%r9,%rax,4), %xmm1
+
+ movss %xmm1, %xmm2
+ movss %xmm1, %xmm0
+
+ # poly
+ mulss .L__real_1_over_3(%rip), %xmm2
+ mulss %xmm1, %xmm0
+ addss .L__real_1_over_2(%rip), %xmm2
+ movss .L__real_log2_tail(%rip), %xmm3
+
+ lea .L__log_128_tail(%rip), %r9
+ lea .L__log_128_lead(%rip), %r10
+
+ mulss %xmm0, %xmm2
+ mulss %xmm5, %xmm3
+ addss %xmm2, %xmm1
+
+ # m*log(2) + log(G) - poly
+ movss .L__real_log2_lead(%rip), %xmm0
+ subss %xmm1, %xmm3 # z2
+ mulss %xmm5, %xmm0
+ addss (%r9,%rax,4), %xmm3 # z2
+ addss (%r10,%rax,4), %xmm0 # z1
+
+ addss %xmm3, %xmm0
+
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__logf_e:
+ movss .L__real_one(%rip), %xmm0
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__near_one:
+ # r = x - 1.0#
+ movss .L__real_two(%rip), %xmm2
+ subss .L__real_one(%rip), %xmm0
+
+ # u = r / (2.0 + r)
+ addss %xmm0, %xmm2
+ movss %xmm0, %xmm1
+ divss %xmm2, %xmm1 # u
+
+ # correction = r * u
+ movss %xmm0, %xmm4
+ mulss %xmm1, %xmm4
+
+ # u = u + u#
+ addss %xmm1, %xmm1
+ movss %xmm1, %xmm2
+ mulss %xmm2, %xmm2 # v = u^2
+
+ # r2 = (u * v * (ca_1 + v * ca_2) - correction)
+ movss %xmm1, %xmm3
+ mulss %xmm2, %xmm3 # u^3
+ mulss .L__real_ca2(%rip), %xmm2 # Bu^2
+ addss .L__real_ca1(%rip), %xmm2 # +A
+ mulss %xmm3, %xmm2
+ subss %xmm4, %xmm2 # -correction
+
+ # r + r2
+ addss %xmm2, %xmm0
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__denormal_adjust:
+ por .L__real_one(%rip), %xmm2
+ subss .L__real_one(%rip), %xmm2
+ movdqa %xmm2, %xmm5
+ pand .L__real_mant(%rip), %xmm2
+ movd %xmm2, %eax
+ psrld $23, %xmm5
+ psubd .L__mask_253(%rip), %xmm5
+ cvtdq2ps %xmm5, %xmm5
+ jmp .L__continue_common
+
+.p2align 4,,15
+.L__x_is_zero_or_neg:
+ jne .L__x_is_neg
+
+ movss .L__real_ninf(%rip), %xmm1
+ mov .L__flag_x_zero(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_neg:
+
+ movss .L__real_nan(%rip), %xmm1
+ mov .L__flag_x_neg(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+
+ cmp .L__real_inf(%rip), %eax
+ je .L__finish
+
+ cmp .L__real_ninf(%rip), %eax
+ je .L__x_is_neg
+
+ mov .L__real_qnanbit(%rip), %r9d
+ and %eax, %r9d
+ jnz .L__finish
+
+ or .L__real_qnanbit(%rip), %eax
+ movd %eax, %xmm1
+ mov .L__flag_x_nan(%rip), %edi
+ call fname_special
+ jmp .L__finish
+
+.p2align 4,,15
+.L__finish:
+ add $stack_size, %rsp
+ ret
+
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_zero: .long 00000001
+.L__flag_x_neg: .long 00000002
+.L__flag_x_nan: .long 00000003
+
+.align 16
+
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+.L__real_neg_qnan: .quad 0x0ffc00000ffc00000
+ .quad 0x0ffc00000ffc00000
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantissa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+
+.L__mask_mant_all7: .quad 0x00000000007f0000
+ .quad 0x00000000007f0000
+.L__mask_mant8: .quad 0x0000000000008000
+ .quad 0x0000000000008000
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+
+
+.align 16
+
+.L__real_neg127: .long 0x0c2fe0000
+ .long 0
+ .quad 0
+
+.L__mask_253: .long 0x000000fd
+ .long 0
+ .quad 0
+
+.L__real_threshold: .long 0x3d800000
+ .long 0
+ .quad 0
+
+.L__mask_01: .long 0x00000001
+ .long 0
+ .quad 0
+
+.L__mask_80: .long 0x00000080
+ .long 0
+ .quad 0
+
+.L__real_3b800000: .long 0x3b800000
+ .long 0
+ .quad 0
+
+.L__real_1_over_3: .long 0x3eaaaaab
+ .long 0
+ .quad 0
+
+.L__real_1_over_2: .long 0x3f000000
+ .long 0
+ .quad 0
+
+.align 16
+.L__log_128_lead:
+ .long 0x00000000
+ .long 0x3bff0000
+ .long 0x3c7e0000
+ .long 0x3cbdc000
+ .long 0x3cfc1000
+ .long 0x3d1cf000
+ .long 0x3d3ba000
+ .long 0x3d5a1000
+ .long 0x3d785000
+ .long 0x3d8b2000
+ .long 0x3d9a0000
+ .long 0x3da8d000
+ .long 0x3db78000
+ .long 0x3dc61000
+ .long 0x3dd49000
+ .long 0x3de2f000
+ .long 0x3df13000
+ .long 0x3dff6000
+ .long 0x3e06b000
+ .long 0x3e0db000
+ .long 0x3e14a000
+ .long 0x3e1b8000
+ .long 0x3e226000
+ .long 0x3e293000
+ .long 0x3e2ff000
+ .long 0x3e36b000
+ .long 0x3e3d5000
+ .long 0x3e43f000
+ .long 0x3e4a9000
+ .long 0x3e511000
+ .long 0x3e579000
+ .long 0x3e5e1000
+ .long 0x3e647000
+ .long 0x3e6ae000
+ .long 0x3e713000
+ .long 0x3e778000
+ .long 0x3e7dc000
+ .long 0x3e820000
+ .long 0x3e851000
+ .long 0x3e882000
+ .long 0x3e8b3000
+ .long 0x3e8e4000
+ .long 0x3e914000
+ .long 0x3e944000
+ .long 0x3e974000
+ .long 0x3e9a3000
+ .long 0x3e9d3000
+ .long 0x3ea02000
+ .long 0x3ea30000
+ .long 0x3ea5f000
+ .long 0x3ea8d000
+ .long 0x3eabb000
+ .long 0x3eae8000
+ .long 0x3eb16000
+ .long 0x3eb43000
+ .long 0x3eb70000
+ .long 0x3eb9c000
+ .long 0x3ebc9000
+ .long 0x3ebf5000
+ .long 0x3ec21000
+ .long 0x3ec4d000
+ .long 0x3ec78000
+ .long 0x3eca3000
+ .long 0x3ecce000
+ .long 0x3ecf9000
+ .long 0x3ed24000
+ .long 0x3ed4e000
+ .long 0x3ed78000
+ .long 0x3eda2000
+ .long 0x3edcc000
+ .long 0x3edf5000
+ .long 0x3ee1e000
+ .long 0x3ee47000
+ .long 0x3ee70000
+ .long 0x3ee99000
+ .long 0x3eec1000
+ .long 0x3eeea000
+ .long 0x3ef12000
+ .long 0x3ef3a000
+ .long 0x3ef61000
+ .long 0x3ef89000
+ .long 0x3efb0000
+ .long 0x3efd7000
+ .long 0x3effe000
+ .long 0x3f012000
+ .long 0x3f025000
+ .long 0x3f039000
+ .long 0x3f04c000
+ .long 0x3f05f000
+ .long 0x3f072000
+ .long 0x3f084000
+ .long 0x3f097000
+ .long 0x3f0aa000
+ .long 0x3f0bc000
+ .long 0x3f0cf000
+ .long 0x3f0e1000
+ .long 0x3f0f4000
+ .long 0x3f106000
+ .long 0x3f118000
+ .long 0x3f12a000
+ .long 0x3f13c000
+ .long 0x3f14e000
+ .long 0x3f160000
+ .long 0x3f172000
+ .long 0x3f183000
+ .long 0x3f195000
+ .long 0x3f1a7000
+ .long 0x3f1b8000
+ .long 0x3f1c9000
+ .long 0x3f1db000
+ .long 0x3f1ec000
+ .long 0x3f1fd000
+ .long 0x3f20e000
+ .long 0x3f21f000
+ .long 0x3f230000
+ .long 0x3f241000
+ .long 0x3f252000
+ .long 0x3f263000
+ .long 0x3f273000
+ .long 0x3f284000
+ .long 0x3f295000
+ .long 0x3f2a5000
+ .long 0x3f2b5000
+ .long 0x3f2c6000
+ .long 0x3f2d6000
+ .long 0x3f2e6000
+ .long 0x3f2f7000
+ .long 0x3f307000
+ .long 0x3f317000
+
+.align 16
+.L__log_128_tail:
+ .long 0x00000000
+ .long 0x3429ac41
+ .long 0x35a8b0fc
+ .long 0x368d83ea
+ .long 0x361b0e78
+ .long 0x3687b9fe
+ .long 0x3631ec65
+ .long 0x36dd7119
+ .long 0x35c30045
+ .long 0x379b7751
+ .long 0x37ebcb0d
+ .long 0x37839f83
+ .long 0x37528ae5
+ .long 0x37a2eb18
+ .long 0x36da7495
+ .long 0x36a91eb7
+ .long 0x3783b715
+ .long 0x371131db
+ .long 0x383f3e68
+ .long 0x38156a97
+ .long 0x38297c0f
+ .long 0x387e100f
+ .long 0x3815b665
+ .long 0x37e5e3a1
+ .long 0x38183853
+ .long 0x35fe719d
+ .long 0x38448108
+ .long 0x38503290
+ .long 0x373539e8
+ .long 0x385e0ff1
+ .long 0x3864a740
+ .long 0x3786742d
+ .long 0x387be3cd
+ .long 0x3685ad3e
+ .long 0x3803b715
+ .long 0x37adcbdc
+ .long 0x380c36af
+ .long 0x371652d3
+ .long 0x38927139
+ .long 0x38c5fcd7
+ .long 0x38ae55d5
+ .long 0x3818c169
+ .long 0x38a0fde7
+ .long 0x38ad09ef
+ .long 0x3862bae1
+ .long 0x38eecd4c
+ .long 0x3798aad2
+ .long 0x37421a1a
+ .long 0x38c5e10e
+ .long 0x37bf2aee
+ .long 0x382d872d
+ .long 0x37ee2e8a
+ .long 0x38dedfac
+ .long 0x3802f2b9
+ .long 0x38481e9b
+ .long 0x380eaa2b
+ .long 0x38ebfb5d
+ .long 0x38255fdd
+ .long 0x38783b82
+ .long 0x3851da1e
+ .long 0x374e1b05
+ .long 0x388f439b
+ .long 0x38ca0e10
+ .long 0x38cac08b
+ .long 0x3891f65f
+ .long 0x378121cb
+ .long 0x386c9a9a
+ .long 0x38949923
+ .long 0x38777bcc
+ .long 0x37b12d26
+ .long 0x38a6ced3
+ .long 0x38ebd3e6
+ .long 0x38fbe3cd
+ .long 0x38d785c2
+ .long 0x387e7e00
+ .long 0x38f392c5
+ .long 0x37d40983
+ .long 0x38081a7c
+ .long 0x3784c3ad
+ .long 0x38cce923
+ .long 0x380f5faf
+ .long 0x3891fd38
+ .long 0x38ac47bc
+ .long 0x3897042b
+ .long 0x392952d2
+ .long 0x396fced4
+ .long 0x37f97073
+ .long 0x385e9eae
+ .long 0x3865c84a
+ .long 0x38130ba3
+ .long 0x3979cf16
+ .long 0x3938cac9
+ .long 0x38c3d2f4
+ .long 0x39755dec
+ .long 0x38e6b467
+ .long 0x395c0fb8
+ .long 0x383ebce0
+ .long 0x38dcd192
+ .long 0x39186bdf
+ .long 0x392de74c
+ .long 0x392f0944
+ .long 0x391bff61
+ .long 0x38e9ed44
+ .long 0x38686dc8
+ .long 0x396b99a7
+ .long 0x39099c89
+ .long 0x37a27673
+ .long 0x390bdaa3
+ .long 0x397069ab
+ .long 0x388449ff
+ .long 0x39013538
+ .long 0x392dc268
+ .long 0x3947f423
+ .long 0x394ff17c
+ .long 0x3945e10e
+ .long 0x3929e8f5
+ .long 0x38f85db0
+ .long 0x38735f99
+ .long 0x396c08db
+ .long 0x3909e600
+ .long 0x37b4996f
+ .long 0x391233cc
+ .long 0x397cead9
+ .long 0x38adb5cd
+ .long 0x3920261a
+ .long 0x3958ee36
+ .long 0x35aa4905
+ .long 0x37cbd11e
+ .long 0x3805fdf4
+
+.align 16
+.L__log_F_inv:
+ .long 0x40000000
+ .long 0x3ffe03f8
+ .long 0x3ffc0fc1
+ .long 0x3ffa232d
+ .long 0x3ff83e10
+ .long 0x3ff6603e
+ .long 0x3ff4898d
+ .long 0x3ff2b9d6
+ .long 0x3ff0f0f1
+ .long 0x3fef2eb7
+ .long 0x3fed7304
+ .long 0x3febbdb3
+ .long 0x3fea0ea1
+ .long 0x3fe865ac
+ .long 0x3fe6c2b4
+ .long 0x3fe52598
+ .long 0x3fe38e39
+ .long 0x3fe1fc78
+ .long 0x3fe07038
+ .long 0x3fdee95c
+ .long 0x3fdd67c9
+ .long 0x3fdbeb62
+ .long 0x3fda740e
+ .long 0x3fd901b2
+ .long 0x3fd79436
+ .long 0x3fd62b81
+ .long 0x3fd4c77b
+ .long 0x3fd3680d
+ .long 0x3fd20d21
+ .long 0x3fd0b6a0
+ .long 0x3fcf6475
+ .long 0x3fce168a
+ .long 0x3fcccccd
+ .long 0x3fcb8728
+ .long 0x3fca4588
+ .long 0x3fc907da
+ .long 0x3fc7ce0c
+ .long 0x3fc6980c
+ .long 0x3fc565c8
+ .long 0x3fc43730
+ .long 0x3fc30c31
+ .long 0x3fc1e4bc
+ .long 0x3fc0c0c1
+ .long 0x3fbfa030
+ .long 0x3fbe82fa
+ .long 0x3fbd6910
+ .long 0x3fbc5264
+ .long 0x3fbb3ee7
+ .long 0x3fba2e8c
+ .long 0x3fb92144
+ .long 0x3fb81703
+ .long 0x3fb70fbb
+ .long 0x3fb60b61
+ .long 0x3fb509e7
+ .long 0x3fb40b41
+ .long 0x3fb30f63
+ .long 0x3fb21643
+ .long 0x3fb11fd4
+ .long 0x3fb02c0b
+ .long 0x3faf3ade
+ .long 0x3fae4c41
+ .long 0x3fad602b
+ .long 0x3fac7692
+ .long 0x3fab8f6a
+ .long 0x3faaaaab
+ .long 0x3fa9c84a
+ .long 0x3fa8e83f
+ .long 0x3fa80a81
+ .long 0x3fa72f05
+ .long 0x3fa655c4
+ .long 0x3fa57eb5
+ .long 0x3fa4a9cf
+ .long 0x3fa3d70a
+ .long 0x3fa3065e
+ .long 0x3fa237c3
+ .long 0x3fa16b31
+ .long 0x3fa0a0a1
+ .long 0x3f9fd80a
+ .long 0x3f9f1166
+ .long 0x3f9e4cad
+ .long 0x3f9d89d9
+ .long 0x3f9cc8e1
+ .long 0x3f9c09c1
+ .long 0x3f9b4c70
+ .long 0x3f9a90e8
+ .long 0x3f99d723
+ .long 0x3f991f1a
+ .long 0x3f9868c8
+ .long 0x3f97b426
+ .long 0x3f97012e
+ .long 0x3f964fda
+ .long 0x3f95a025
+ .long 0x3f94f209
+ .long 0x3f944581
+ .long 0x3f939a86
+ .long 0x3f92f114
+ .long 0x3f924925
+ .long 0x3f91a2b4
+ .long 0x3f90fdbc
+ .long 0x3f905a38
+ .long 0x3f8fb824
+ .long 0x3f8f177a
+ .long 0x3f8e7835
+ .long 0x3f8dda52
+ .long 0x3f8d3dcb
+ .long 0x3f8ca29c
+ .long 0x3f8c08c1
+ .long 0x3f8b7034
+ .long 0x3f8ad8f3
+ .long 0x3f8a42f8
+ .long 0x3f89ae41
+ .long 0x3f891ac7
+ .long 0x3f888889
+ .long 0x3f87f781
+ .long 0x3f8767ab
+ .long 0x3f86d905
+ .long 0x3f864b8a
+ .long 0x3f85bf37
+ .long 0x3f853408
+ .long 0x3f84a9fa
+ .long 0x3f842108
+ .long 0x3f839930
+ .long 0x3f83126f
+ .long 0x3f828cc0
+ .long 0x3f820821
+ .long 0x3f81848e
+ .long 0x3f810204
+ .long 0x3f808081
+ .long 0x3f800000
+
+
diff --git a/src/gas/nearbyint.S b/src/gas/nearbyint.S
new file mode 100644
index 0000000..edb1549
--- /dev/null
+++ b/src/gas/nearbyint.S
@@ -0,0 +1,98 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# fabs.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+# double fabs(double x);
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(nearbyint)
+#define fname_special _nearbyint_special
+
+
+# local variable storage offsets
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+ movsd .L__2p52_mask_64(%rip),%xmm2
+ movsd .L__sign_mask_64(%rip),%xmm4
+ movsd %xmm4,%xmm6
+ movsd %xmm0,%xmm1 # move input to xmm register's xmm1 and xmm5
+ movsd %xmm0,%xmm5
+ pand %xmm4,%xmm1 # xmm1 = abs(xmm1)
+ movsd %xmm1,%xmm3 # move xmm1 to xmm3
+ comisd %xmm2,%xmm1 #
+ jnc .L__greater_than_2p52 #
+ jp .L__is_infinity_nan # parity flag is raised if one of the xmm2 or
+ # xmm1 is Nan
+.L__normal_input_case:
+ #sign.u32 = checkbits.u32[1] & 0x80000000;
+ #xmm4 = sign.u32
+ pandn %xmm5,%xmm4
+ #val_2p52.u32[1] = sign.u32 | 0x43300000;
+ #val_2p52.u32[0] = 0;
+ por %xmm4,%xmm2
+ #val_2p52.f64 = (x + val_2p52.f64) - val_2p52.f64;
+ addpd %xmm2,%xmm5
+ subpd %xmm5,%xmm2
+ #val_2p52.u32[1] = ((val_2p52.u32[1] << 1) >> 1) | sign.u32;
+ pand %xmm6,%xmm2
+ por %xmm4,%xmm2
+ movsd %xmm2,%xmm0 # move the result to xmm0 register
+ ret
+.L__special_case:
+.L__greater_than_2p52:
+ ret # result is present in xmm0
+.L__is_infinity_nan:
+ addpd %xmm0,%xmm0
+ ret
+.align 16
+.L__sign_mask_64: .quad 0x7FFFFFFFFFFFFFFF
+ .quad 0
+.L__2p52_mask_64: .quad 0x4330000000000000
+ .quad 0
+.L__exp_mask_64: .quad 0x7FF0000000000000
+ .quad 0
+
+
+
+
+
+
diff --git a/src/gas/pow.S b/src/gas/pow.S
new file mode 100644
index 0000000..8028b83
--- /dev/null
+++ b/src/gas/pow.S
@@ -0,0 +1,2244 @@
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+#ifdef __x86_64__
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# pow.S
+#
+# An implementation of the pow libm function.
+#
+# Prototype:
+#
+# double pow(double x, double y);
+#
+
+#
+# Algorithm:
+# x^y = e^(y*ln(x))
+#
+# Look in exp, log for the respective algorithms
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(pow)
+#define fname_special _pow_special@PLT
+
+
+# local variable storage offsets
+.equ save_x, 0x0
+.equ save_y, 0x10
+.equ p_temp_exp, 0x20
+.equ negate_result, 0x30
+.equ save_ax, 0x40
+.equ y_head, 0x50
+.equ p_temp_log, 0x60
+.equ stack_size, 0x78
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ sub $stack_size, %rsp
+
+ movsd %xmm0, save_x(%rsp)
+ movsd %xmm1, save_y(%rsp)
+
+ mov save_x(%rsp), %rdx
+ mov save_y(%rsp), %r8
+
+ mov .L__exp_mant_mask(%rip), %r10
+ and %r8, %r10
+ jz .L__y_is_zero
+
+ cmp .L__pos_one(%rip), %r8
+ je .L__y_is_one
+
+ mov .L__sign_mask(%rip), %r9
+ and %rdx, %r9
+ cmp .L__sign_mask(%rip), %r9
+ mov .L__pos_zero(%rip), %rax
+ mov %rax, negate_result(%rsp)
+ je .L__x_is_neg
+
+ cmp .L__pos_one(%rip), %rdx
+ je .L__x_is_pos_one
+
+ cmp .L__pos_zero(%rip), %rdx
+ je .L__x_is_zero
+
+ mov .L__exp_mask(%rip), %r9
+ and %rdx, %r9
+ cmp .L__exp_mask(%rip), %r9
+ je .L__x_is_inf_or_nan
+
+ mov .L__exp_mask(%rip), %r10
+ and %r8, %r10
+ cmp .L__ay_max_bound(%rip), %r10
+ jg .L__ay_is_very_large
+
+ mov .L__exp_mask(%rip), %r10
+ and %r8, %r10
+ cmp .L__ay_min_bound(%rip), %r10
+ jl .L__ay_is_very_small
+
+ # -----------------------------
+ # compute log(x) here
+ # -----------------------------
+.L__log_x:
+
+ # compute exponent part
+ xor %r8, %r8
+ movdqa %xmm0, %xmm3
+ psrlq $52, %xmm3
+ movd %xmm0, %r8
+ psubq .L__mask_1023(%rip), %xmm3
+ movdqa %xmm0, %xmm2
+ cvtdq2pd %xmm3, %xmm6 # xexp
+ pand .L__real_mant(%rip), %xmm2
+
+ comisd .L__mask_1023_f(%rip), %xmm6
+ je .L__denormal_adjust
+
+.L__continue_common:
+
+ # compute index into the log tables
+ movsd %xmm0, %xmm7
+ mov %r8, %r9
+ and .L__mask_mant_all8(%rip), %r8
+ and .L__mask_mant9(%rip), %r9
+ subsd .L__real_one(%rip), %xmm7
+ shl %r9
+ add %r9, %r8
+ mov %r8, p_temp_log(%rsp)
+ andpd .L__real_notsign(%rip), %xmm7
+
+ # F, Y, switch to near-one codepath
+ movsd p_temp_log(%rsp), %xmm1
+ shr $44, %r8
+ por .L__real_half(%rip), %xmm2
+ por .L__real_half(%rip), %xmm1
+ comisd .L__real_threshold(%rip), %xmm7
+ lea .L__log_F_inv_head(%rip), %r9
+ lea .L__log_F_inv_tail(%rip), %rdx
+ jb .L__near_one
+
+ # f = F - Y, r = f * inv
+ subsd %xmm2, %xmm1
+ movsd %xmm1, %xmm4
+ mulsd (%r9,%r8,8), %xmm1
+ movsd %xmm1, %xmm5
+ mulsd (%rdx,%r8,8), %xmm4
+ movsd %xmm4, %xmm7
+ addsd %xmm4, %xmm1
+
+ movsd %xmm1, %xmm2
+ movsd %xmm1, %xmm0
+ lea .L__log_256_lead(%rip), %r9
+
+ # poly
+ movsd .L__real_1_over_6(%rip), %xmm3
+ movsd .L__real_1_over_3(%rip), %xmm1
+ mulsd %xmm2, %xmm3
+ mulsd %xmm2, %xmm1
+ mulsd %xmm2, %xmm0
+ subsd %xmm2, %xmm5
+ movsd %xmm0, %xmm4
+ addsd .L__real_1_over_5(%rip), %xmm3
+ addsd .L__real_1_over_2(%rip), %xmm1
+ mulsd %xmm0, %xmm4
+ mulsd %xmm2, %xmm3
+ mulsd %xmm0, %xmm1
+ addsd .L__real_1_over_4(%rip), %xmm3
+ addsd %xmm5, %xmm7
+ mulsd %xmm4, %xmm3
+ addsd %xmm3, %xmm1
+ addsd %xmm7, %xmm1
+
+ movsd .L__real_log2_tail(%rip), %xmm5
+ lea .L__log_256_tail(%rip), %rdx
+ mulsd %xmm6, %xmm5
+ movsd (%r9,%r8,8), %xmm0
+ subsd %xmm1, %xmm5
+
+ movsd (%rdx,%r8,8), %xmm3
+ addsd %xmm5, %xmm3
+ movsd %xmm3, %xmm1
+ subsd %xmm2, %xmm3
+
+ movsd .L__real_log2_lead(%rip), %xmm7
+ mulsd %xmm6, %xmm7
+ addsd %xmm7, %xmm0
+
+ # result of ln(x) is computed from head and tail parts, resH and resT
+ # res = ln(x) = resH + resT
+ # resH and resT are in full precision
+
+ # resT is computed from head and tail parts, resT_h and resT_t
+ # resT = resT_h + resT_t
+
+ # now
+ # xmm3 - resT
+ # xmm0 - resH
+ # xmm1 - (resT_t)
+ # xmm2 - (-resT_h)
+
+.L__log_x_continue:
+
+ movsd %xmm0, %xmm7
+ addsd %xmm3, %xmm0
+ movsd %xmm0, %xmm5
+ andpd .L__real_fffffffff8000000(%rip), %xmm0
+
+ # xmm0 - H
+ # xmm7 - resH
+ # xmm5 - res
+
+ mov save_y(%rsp), %rax
+ and .L__real_fffffffff8000000(%rip), %rax
+
+ addsd %xmm3, %xmm2
+ subsd %xmm5, %xmm7
+ subsd %xmm2, %xmm1
+ addsd %xmm3, %xmm7
+ subsd %xmm0, %xmm5
+
+ mov %rax, y_head(%rsp)
+ movsd save_y(%rsp), %xmm4
+
+ addsd %xmm1, %xmm7
+ addsd %xmm5, %xmm7
+
+ # res = H + T
+ # H has leading 26 bits of precision
+ # T has full precision
+
+ # xmm0 - H
+ # xmm7 - T
+
+ movsd y_head(%rsp), %xmm2
+ subsd %xmm2, %xmm4
+
+ # y is split into head and tail
+ # for y * ln(x) computation
+
+ # xmm4 - Yt
+ # xmm2 - Yh
+ # xmm0 - H
+ # xmm7 - T
+
+ movsd %xmm4, %xmm3
+ movsd %xmm7, %xmm5
+ movsd %xmm0, %xmm6
+ mulsd %xmm7, %xmm3 # YtRt
+ mulsd %xmm0, %xmm4 # YtRh
+ mulsd %xmm2, %xmm5 # YhRt
+ mulsd %xmm2, %xmm6 # YhRh
+
+ movsd %xmm6, %xmm1
+ addsd %xmm4, %xmm3
+ addsd %xmm5, %xmm3
+
+ addsd %xmm3, %xmm1
+ movsd %xmm1, %xmm0
+
+ subsd %xmm1, %xmm6
+ addsd %xmm3, %xmm6
+
+ # y * ln(x) = v + vt
+ # v and vt are in full precision
+
+ # xmm0 - v
+ # xmm6 - vt
+
+ # -----------------------------
+ # compute exp( y * ln(x) ) here
+ # -----------------------------
+
+ # v * (64/ln(2))
+ movsd .L__real_64_by_log2(%rip), %xmm7
+ movsd %xmm0, p_temp_exp(%rsp)
+ mulsd %xmm0, %xmm7
+ mov p_temp_exp(%rsp), %rdx
+
+ # v < 1024*ln(2), ( v * (64/ln(2)) ) < 64*1024
+ # v >= -1075*ln(2), ( v * (64/ln(2)) ) >= 64*(-1075)
+ comisd .L__real_p65536(%rip), %xmm7
+ ja .L__process_result_inf
+
+ comisd .L__real_m68800(%rip), %xmm7
+ jb .L__process_result_zero
+
+ # n = int( v * (64/ln(2)) )
+ cvtpd2dq %xmm7, %xmm4
+ lea .L__two_to_jby64_head_table(%rip), %r10
+ lea .L__two_to_jby64_tail_table(%rip), %r11
+ cvtdq2pd %xmm4, %xmm1
+
+ # r1 = x - n * ln(2)/64 head
+ movsd .L__real_log2_by_64_head(%rip), %xmm2
+ mulsd %xmm1, %xmm2
+ movd %xmm4, %ecx
+ mov $0x3f, %rax
+ and %ecx, %eax
+ subsd %xmm2, %xmm0
+
+ # r2 = - n * ln(2)/64 tail
+ mulsd .L__real_log2_by_64_tail(%rip), %xmm1
+ movsd %xmm0, %xmm2
+
+ # m = (n - j) / 64
+ sub %eax, %ecx
+ sar $6, %ecx
+
+ # r1+r2
+ addsd %xmm1, %xmm2
+ addsd %xmm6, %xmm2 # add vt here
+ movsd %xmm2, %xmm1
+
+ # q
+ movsd .L__real_1_by_2(%rip), %xmm0
+ movsd .L__real_1_by_24(%rip), %xmm3
+ movsd .L__real_1_by_720(%rip), %xmm4
+ mulsd %xmm2, %xmm1
+ mulsd %xmm2, %xmm0
+ mulsd %xmm2, %xmm3
+ mulsd %xmm2, %xmm4
+
+ movsd %xmm1, %xmm5
+ mulsd %xmm2, %xmm1
+ addsd .L__real_one(%rip), %xmm0
+ addsd .L__real_1_by_6(%rip), %xmm3
+ mulsd %xmm1, %xmm5
+ addsd .L__real_1_by_120(%rip), %xmm4
+ mulsd %xmm2, %xmm0
+ mulsd %xmm1, %xmm3
+
+ mulsd %xmm5, %xmm4
+
+ # deal with denormal results
+ xor %r9d, %r9d
+ cmp .L__denormal_threshold(%rip), %ecx
+
+ addsd %xmm4, %xmm3
+ addsd %xmm3, %xmm0
+
+ cmovle %ecx, %r9d
+ add $1023, %rcx
+ shl $52, %rcx
+
+ # f1, f2
+ movsd (%r11,%rax,8), %xmm5
+ movsd (%r10,%rax,8), %xmm1
+ mulsd %xmm0, %xmm5
+ mulsd %xmm0, %xmm1
+
+ cmp .L__real_inf(%rip), %rcx
+
+ # (f1+f2)*(1+q)
+ addsd (%r11,%rax,8), %xmm5
+ addsd %xmm5, %xmm1
+ addsd (%r10,%rax,8), %xmm1
+ movsd %xmm1, %xmm0
+
+ je .L__process_almost_inf
+
+ test %r9d, %r9d
+ mov %rcx, p_temp_exp(%rsp)
+ jnz .L__process_denormal
+ mulsd p_temp_exp(%rsp), %xmm0
+ orpd negate_result(%rsp), %xmm0
+
+.L__final_check:
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__process_almost_inf:
+ comisd .L__real_one(%rip), %xmm0
+ jae .L__process_result_inf
+
+ orpd .L__enable_almost_inf(%rip), %xmm0
+ orpd negate_result(%rsp), %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__process_denormal:
+ mov %r9d, %ecx
+ xor %r11d, %r11d
+ comisd .L__real_one(%rip), %xmm0
+ cmovae %ecx, %r11d
+ cmp .L__denormal_threshold(%rip), %r11d
+ jne .L__process_true_denormal
+
+ mulsd p_temp_exp(%rsp), %xmm0
+ orpd negate_result(%rsp), %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__process_true_denormal:
+ xor %r8, %r8
+ cmp .L__denormal_tiny_threshold(%rip), %rdx
+ mov $1, %r9
+ jg .L__process_denormal_tiny
+ add $1074, %ecx
+ cmovs %r8, %rcx
+ shl %cl, %r9
+ mov %r9, %rcx
+
+ mov %rcx, p_temp_exp(%rsp)
+ mulsd p_temp_exp(%rsp), %xmm0
+ orpd negate_result(%rsp), %xmm0
+ jmp .L__z_denormal
+
+.p2align 4,,15
+.L__process_denormal_tiny:
+ movsd .L__real_smallest_denormal(%rip), %xmm0
+ orpd negate_result(%rsp), %xmm0
+ jmp .L__z_denormal
+
+.p2align 4,,15
+.L__process_result_zero:
+ mov .L__real_zero(%rip), %r11
+ or negate_result(%rsp), %r11
+ jmp .L__z_is_zero_or_inf
+
+.p2align 4,,15
+.L__process_result_inf:
+ mov .L__real_inf(%rip), %r11
+ or negate_result(%rsp), %r11
+ jmp .L__z_is_zero_or_inf
+
+.p2align 4,,15
+.L__denormal_adjust:
+ por .L__real_one(%rip), %xmm2
+ subsd .L__real_one(%rip), %xmm2
+ movsd %xmm2, %xmm5
+ pand .L__real_mant(%rip), %xmm2
+ movd %xmm2, %r8
+ psrlq $52, %xmm5
+ psubd .L__mask_2045(%rip), %xmm5
+ cvtdq2pd %xmm5, %xmm6
+ jmp .L__continue_common
+
+.p2align 4,,15
+.L__x_is_neg:
+
+ mov .L__exp_mask(%rip), %r10
+ and %r8, %r10
+ cmp .L__ay_max_bound(%rip), %r10
+ jg .L__ay_is_very_large
+
+ # determine if y is an integer
+ mov .L__exp_mant_mask(%rip), %r10
+ and %r8, %r10
+ mov %r10, %r11
+ mov .L__exp_shift(%rip), %rcx
+ shr %cl, %r10
+ sub .L__exp_bias(%rip), %r10
+ js .L__x_is_neg_y_is_not_int
+
+ mov .L__exp_mant_mask(%rip), %rax
+ and %rdx, %rax
+ mov %rax, save_ax(%rsp)
+
+ cmp .L__yexp_53(%rip), %r10
+ mov %r10, %rcx
+ jg .L__continue_after_y_int_check
+
+ mov .L__mant_full(%rip), %r9
+ shr %cl, %r9
+ and %r11, %r9
+ jnz .L__x_is_neg_y_is_not_int
+
+ mov .L__1_before_mant(%rip), %r9
+ shr %cl, %r9
+ and %r11, %r9
+ jz .L__continue_after_y_int_check
+
+ mov .L__sign_mask(%rip), %rax
+ mov %rax, negate_result(%rsp)
+
+.L__continue_after_y_int_check:
+
+ cmp .L__neg_zero(%rip), %rdx
+ je .L__x_is_zero
+
+ cmp .L__neg_one(%rip), %rdx
+ je .L__x_is_neg_one
+
+ mov .L__exp_mask(%rip), %r9
+ and %rdx, %r9
+ cmp .L__exp_mask(%rip), %r9
+ je .L__x_is_inf_or_nan
+
+ movsd save_ax(%rsp), %xmm0
+ jmp .L__log_x
+
+
+.p2align 4,,15
+.L__near_one:
+
+ # f = F - Y, r = f * inv
+ movsd %xmm1, %xmm0
+ subsd %xmm2, %xmm1
+ movsd %xmm1, %xmm4
+
+ movsd (%r9,%r8,8), %xmm3
+ addsd (%rdx,%r8,8), %xmm3
+ mulsd %xmm3, %xmm4
+ andpd .L__real_fffffffff8000000(%rip), %xmm4
+ movsd %xmm4, %xmm5 # r1
+ mulsd %xmm0, %xmm4
+ subsd %xmm4, %xmm1
+ mulsd %xmm3, %xmm1
+ movsd %xmm1, %xmm7 # r2
+ addsd %xmm5, %xmm1
+
+ movsd %xmm1, %xmm2
+ movsd %xmm1, %xmm0
+
+ lea .L__log_256_lead(%rip), %r9
+
+ # poly
+ movsd .L__real_1_over_7(%rip), %xmm3
+ movsd .L__real_1_over_4(%rip), %xmm1
+ mulsd %xmm2, %xmm3
+ mulsd %xmm2, %xmm1
+ mulsd %xmm2, %xmm0
+ movsd %xmm0, %xmm4
+ addsd .L__real_1_over_6(%rip), %xmm3
+ addsd .L__real_1_over_3(%rip), %xmm1
+ mulsd %xmm0, %xmm4
+ mulsd %xmm2, %xmm3
+ mulsd %xmm2, %xmm1
+ addsd .L__real_1_over_5(%rip), %xmm3
+ mulsd %xmm2, %xmm3
+ mulsd %xmm0, %xmm1
+ mulsd %xmm4, %xmm3
+
+ movsd %xmm5, %xmm2
+ movsd %xmm7, %xmm0
+ mulsd %xmm0, %xmm0
+ mulsd .L__real_1_over_2(%rip), %xmm0
+ mulsd %xmm7, %xmm5
+ addsd %xmm0, %xmm5
+ addsd %xmm7, %xmm5
+
+ movsd %xmm2, %xmm0
+ movsd %xmm2, %xmm7
+ mulsd %xmm0, %xmm0
+ mulsd .L__real_1_over_2(%rip), %xmm0
+ movsd %xmm0, %xmm4
+ addsd %xmm0, %xmm2 # r1 + r1^2/2
+ subsd %xmm2, %xmm7
+ addsd %xmm4, %xmm7
+
+ addsd %xmm7, %xmm3
+ movsd .L__real_log2_tail(%rip), %xmm4
+ addsd %xmm3, %xmm1
+ mulsd %xmm6, %xmm4
+ lea .L__log_256_tail(%rip), %rdx
+ addsd %xmm5, %xmm1
+ addsd (%rdx,%r8,8), %xmm4
+ subsd %xmm1, %xmm4
+
+ movsd %xmm4, %xmm3
+ movsd %xmm4, %xmm1
+ subsd %xmm2, %xmm3
+
+ movsd (%r9,%r8,8), %xmm0
+ movsd .L__real_log2_lead(%rip), %xmm7
+ mulsd %xmm6, %xmm7
+ addsd %xmm7, %xmm0
+
+ jmp .L__log_x_continue
+
+
+.p2align 4,,15
+.L__x_is_pos_one:
+ xor %rax, %rax
+ mov .L__exp_mask(%rip), %r10
+ and %r8, %r10
+ cmp .L__exp_mask(%rip), %r10
+ cmove %r8, %rax
+ mov .L__mant_mask(%rip), %r10
+ and %rax, %r10
+ jz .L__final_check
+
+ mov .L__qnan_set(%rip), %r10
+ and %r8, %r10
+ jnz .L__final_check
+
+ movsd save_x(%rsp), %xmm0
+ movsd save_y(%rsp), %xmm1
+ movsd .L__pos_one(%rip), %xmm2
+ mov .L__flag_x_one_y_snan(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__y_is_zero:
+ xor %rax, %rax
+ mov .L__exp_mask(%rip), %r9
+ mov .L__real_one(%rip), %r11
+ and %rdx, %r9
+ cmp .L__exp_mask(%rip), %r9
+ cmove %rdx, %rax
+ mov .L__mant_mask(%rip), %r9
+ and %rax, %r9
+ jnz .L__x_is_nan
+
+ movsd .L__real_one(%rip), %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__y_is_one:
+ xor %rax, %rax
+ mov %rdx, %r11
+ mov .L__exp_mask(%rip), %r9
+ or .L__qnan_set(%rip), %r11
+ and %rdx, %r9
+ cmp .L__exp_mask(%rip), %r9
+ cmove %rdx, %rax
+ mov .L__mant_mask(%rip), %r9
+ and %rax, %r9
+ jnz .L__x_is_nan
+
+ movd %rdx, %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_neg_one:
+ mov .L__pos_one(%rip), %rdx
+ or negate_result(%rsp), %rdx
+ xor %rax, %rax
+ mov %r8, %r11
+ mov .L__exp_mask(%rip), %r10
+ or .L__qnan_set(%rip), %r11
+ and %r8, %r10
+ cmp .L__exp_mask(%rip), %r10
+ cmove %r8, %rax
+ mov .L__mant_mask(%rip), %r10
+ and %rax, %r10
+ jnz .L__y_is_nan
+
+ movd %rdx, %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_neg_y_is_not_int:
+ mov .L__exp_mask(%rip), %r9
+ and %rdx, %r9
+ cmp .L__exp_mask(%rip), %r9
+ je .L__x_is_inf_or_nan
+
+ cmp .L__neg_zero(%rip), %rdx
+ je .L__x_is_zero
+
+ movsd save_x(%rsp), %xmm0
+ movsd save_y(%rsp), %xmm1
+ movsd .L__qnan(%rip), %xmm2
+ mov .L__flag_x_neg_y_notint(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__ay_is_very_large:
+ mov .L__exp_mask(%rip), %r9
+ and %rdx, %r9
+ cmp .L__exp_mask(%rip), %r9
+ je .L__x_is_inf_or_nan
+
+ mov .L__exp_mant_mask(%rip), %r9
+ and %rdx, %r9
+ jz .L__x_is_zero
+
+ cmp .L__neg_one(%rip), %rdx
+ je .L__x_is_neg_one
+
+ mov %rdx, %r9
+ and .L__exp_mant_mask(%rip), %r9
+ cmp .L__pos_one(%rip), %r9
+ jl .L__ax_lt1_y_is_large_or_inf_or_nan
+
+ jmp .L__ax_gt1_y_is_large_or_inf_or_nan
+
+.p2align 4,,15
+.L__x_is_zero:
+ mov .L__exp_mask(%rip), %r10
+ xor %rax, %rax
+ and %r8, %r10
+ cmp .L__exp_mask(%rip), %r10
+ je .L__x_is_zero_y_is_inf_or_nan
+
+ mov .L__sign_mask(%rip), %r10
+ and %r8, %r10
+ cmovnz .L__pos_inf(%rip), %rax
+ jnz .L__x_is_zero_z_is_inf
+
+ movd %rax, %xmm0
+ orpd negate_result(%rsp), %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_zero_z_is_inf:
+
+ movsd save_x(%rsp), %xmm0
+ movsd save_y(%rsp), %xmm1
+ movd %rax, %xmm2
+ orpd negate_result(%rsp), %xmm2
+ mov .L__flag_x_zero_z_inf(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_zero_y_is_inf_or_nan:
+ mov %r8, %r11
+ cmp .L__neg_inf(%rip), %r8
+ cmove .L__pos_inf(%rip), %rax
+ je .L__x_is_zero_z_is_inf
+
+ or .L__qnan_set(%rip), %r11
+ mov .L__mant_mask(%rip), %r10
+ and %r8, %r10
+ jnz .L__y_is_nan
+
+ movd %rax, %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+ xor %r11, %r11
+ mov .L__sign_mask(%rip), %r10
+ and %r8, %r10
+ cmovz .L__pos_inf(%rip), %r11
+ mov %rdx, %rax
+ mov .L__mant_mask(%rip), %r9
+ or .L__qnan_set(%rip), %rax
+ and %rdx, %r9
+ cmovnz %rax, %r11
+ jnz .L__x_is_nan
+
+ xor %rax, %rax
+ mov %r8, %r9
+ mov .L__exp_mask(%rip), %r10
+ or .L__qnan_set(%rip), %r9
+ and %r8, %r10
+ cmp .L__exp_mask(%rip), %r10
+ cmove %r8, %rax
+ mov .L__mant_mask(%rip), %r10
+ and %rax, %r10
+ cmovnz %r9, %r11
+ jnz .L__y_is_nan
+
+ movd %r11, %xmm0
+ orpd negate_result(%rsp), %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__ay_is_very_small:
+ movsd .L__pos_one(%rip), %xmm0
+ addsd %xmm1, %xmm0
+ jmp .L__final_check
+
+
+.p2align 4,,15
+.L__ax_lt1_y_is_large_or_inf_or_nan:
+ xor %r11, %r11
+ mov .L__sign_mask(%rip), %r10
+ and %r8, %r10
+ cmovnz .L__pos_inf(%rip), %r11
+ jmp .L__adjust_for_nan
+
+.p2align 4,,15
+.L__ax_gt1_y_is_large_or_inf_or_nan:
+ xor %r11, %r11
+ mov .L__sign_mask(%rip), %r10
+ and %r8, %r10
+ cmovz .L__pos_inf(%rip), %r11
+
+.p2align 4,,15
+.L__adjust_for_nan:
+
+ xor %rax, %rax
+ mov %r8, %r9
+ mov .L__exp_mask(%rip), %r10
+ or .L__qnan_set(%rip), %r9
+ and %r8, %r10
+ cmp .L__exp_mask(%rip), %r10
+ cmove %r8, %rax
+ mov .L__mant_mask(%rip), %r10
+ and %rax, %r10
+ cmovnz %r9, %r11
+ jnz .L__y_is_nan
+
+ test %rax, %rax
+ jnz .L__y_is_inf
+
+.p2align 4,,15
+.L__z_is_zero_or_inf:
+
+ mov .L__flag_z_zero(%rip), %edi
+ test %r11, %r11
+ cmovnz .L__flag_z_inf(%rip), %edi
+
+ movsd save_x(%rsp), %xmm0
+ movsd save_y(%rsp), %xmm1
+ movd %r11, %xmm2
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__y_is_inf:
+
+ movd %r11, %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_nan:
+
+ xor %rax, %rax
+ mov .L__exp_mask(%rip), %r10
+ and %r8, %r10
+ cmp .L__exp_mask(%rip), %r10
+ cmove %r8, %rax
+ mov .L__mant_mask(%rip), %r10
+ and %rax, %r10
+ jnz .L__x_is_nan_y_is_nan
+
+ mov .L__qnan_set(%rip), %r9
+ and %rdx, %r9
+ movd %r11, %xmm0
+ jnz .L__final_check
+
+ movsd save_x(%rsp), %xmm0
+ movsd save_y(%rsp), %xmm1
+ movd %r11, %xmm2
+ mov .L__flag_x_nan(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__y_is_nan:
+
+ mov .L__qnan_set(%rip), %r10
+ and %r8, %r10
+ movd %r11, %xmm0
+ jnz .L__final_check
+
+ movsd save_x(%rsp), %xmm0
+ movsd save_y(%rsp), %xmm1
+ movd %r11, %xmm2
+ mov .L__flag_y_nan(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_nan_y_is_nan:
+
+ mov .L__qnan_set(%rip), %r9
+ and %rdx, %r9
+ jz .L__continue_xy_nan
+
+ mov .L__qnan_set(%rip), %r10
+ and %r8, %r10
+ jz .L__continue_xy_nan
+
+ movd %r11, %xmm0
+ jmp .L__final_check
+
+.L__continue_xy_nan:
+ movsd save_x(%rsp), %xmm0
+ movsd save_y(%rsp), %xmm1
+ movd %r11, %xmm2
+ mov .L__flag_x_nan_y_nan(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__z_denormal:
+
+ movsd %xmm0, %xmm2
+ movsd save_x(%rsp), %xmm0
+ movsd save_y(%rsp), %xmm1
+ mov .L__flag_z_denormal(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+
+.data
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_one_y_snan: .long 1
+.L__flag_x_zero_z_inf: .long 2
+.L__flag_x_nan: .long 3
+.L__flag_y_nan: .long 4
+.L__flag_x_nan_y_nan: .long 5
+.L__flag_x_neg_y_notint: .long 6
+.L__flag_z_zero: .long 7
+.L__flag_z_denormal: .long 8
+.L__flag_z_inf: .long 9
+
+.align 16
+
+.L__ay_max_bound: .quad 0x43e0000000000000
+.L__ay_min_bound: .quad 0x3c00000000000000
+.L__sign_mask: .quad 0x8000000000000000
+.L__sign_and_exp_mask: .quad 0x0fff0000000000000
+.L__exp_mask: .quad 0x7ff0000000000000
+.L__neg_inf: .quad 0x0fff0000000000000
+.L__pos_inf: .quad 0x7ff0000000000000
+.L__pos_one: .quad 0x3ff0000000000000
+.L__pos_zero: .quad 0x0000000000000000
+.L__exp_mant_mask: .quad 0x7fffffffffffffff
+.L__mant_mask: .quad 0x000fffffffffffff
+.L__ind_pattern: .quad 0x0fff8000000000000
+
+.L__neg_qnan: .quad 0x0fff8000000000000
+.L__qnan: .quad 0x7ff8000000000000
+.L__qnan_set: .quad 0x0008000000000000
+
+.L__neg_one: .quad 0x0bff0000000000000
+.L__neg_zero: .quad 0x8000000000000000
+
+.L__exp_shift: .quad 0x0000000000000034 # 52
+.L__exp_bias: .quad 0x00000000000003ff # 1023
+.L__exp_bias_m1: .quad 0x00000000000003fe # 1022
+
+.L__yexp_53: .quad 0x0000000000000035 # 53
+.L__mant_full: .quad 0x000fffffffffffff
+.L__1_before_mant: .quad 0x0010000000000000
+
+.L__mask_mant_all8: .quad 0x000ff00000000000
+.L__mask_mant9: .quad 0x0000080000000000
+
+.align 16
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000
+ .quad 0x0fffffffff8000000
+
+.L__mask_8000000000000000: .quad 0x8000000000000000
+ .quad 0x8000000000000000
+
+.L__real_4090040000000000: .quad 0x4090040000000000
+ .quad 0x4090040000000000
+
+.L__real_C090C80000000000: .quad 0x0C090C80000000000
+ .quad 0x0C090C80000000000
+
+#---------------------
+# log data
+#---------------------
+
+.align 16
+
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0000000000000000
+.L__real_inf: .quad 0x7ff0000000000000 # +inf
+ .quad 0x0000000000000000
+.L__real_nan: .quad 0x7ff8000000000000 # NaN
+ .quad 0x0000000000000000
+.L__real_mant: .quad 0x000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000000000000000
+.L__mask_1023: .quad 0x00000000000003ff
+ .quad 0x0000000000000000
+.L__mask_001: .quad 0x0000000000000001
+ .quad 0x0000000000000000
+
+
+.L__real_log2_lead: .quad 0x3fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x0000000000000000
+.L__real_log2_tail: .quad 0x3e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x0000000000000000
+
+.L__real_two: .quad 0x4000000000000000 # 2
+ .quad 0x0000000000000000
+
+.L__real_one: .quad 0x3ff0000000000000 # 1
+ .quad 0x0000000000000000
+
+.L__real_half: .quad 0x3fe0000000000000 # 1/2
+ .quad 0x0000000000000000
+
+.L__mask_100: .quad 0x0000000000000100
+ .quad 0x0000000000000000
+
+.L__real_1_over_2: .quad 0x3fe0000000000000
+ .quad 0x0000000000000000
+.L__real_1_over_3: .quad 0x3fd5555555555555
+ .quad 0x0000000000000000
+.L__real_1_over_4: .quad 0x3fd0000000000000
+ .quad 0x0000000000000000
+.L__real_1_over_5: .quad 0x3fc999999999999a
+ .quad 0x0000000000000000
+.L__real_1_over_6: .quad 0x3fc5555555555555
+ .quad 0x0000000000000000
+.L__real_1_over_7: .quad 0x3fc2492492492494
+ .quad 0x0000000000000000
+
+.L__mask_1023_f: .quad 0x0c08ff80000000000
+ .quad 0x0000000000000000
+
+.L__mask_2045: .quad 0x00000000000007fd
+ .quad 0x0000000000000000
+
+.L__real_threshold: .quad 0x3fc0000000000000 # 0.125
+ .quad 0x3fc0000000000000
+
+.L__real_notsign: .quad 0x7ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x0000000000000000
+
+
+.align 16
+.L__log_256_lead:
+ .quad 0x0000000000000000
+ .quad 0x3f6ff00aa0000000
+ .quad 0x3f7fe02a60000000
+ .quad 0x3f87dc4750000000
+ .quad 0x3f8fc0a8b0000000
+ .quad 0x3f93cea440000000
+ .quad 0x3f97b91b00000000
+ .quad 0x3f9b9fc020000000
+ .quad 0x3f9f829b00000000
+ .quad 0x3fa1b0d980000000
+ .quad 0x3fa39e87b0000000
+ .quad 0x3fa58a5ba0000000
+ .quad 0x3fa77458f0000000
+ .quad 0x3fa95c8300000000
+ .quad 0x3fab42dd70000000
+ .quad 0x3fad276b80000000
+ .quad 0x3faf0a30c0000000
+ .quad 0x3fb0759830000000
+ .quad 0x3fb16536e0000000
+ .quad 0x3fb253f620000000
+ .quad 0x3fb341d790000000
+ .quad 0x3fb42edcb0000000
+ .quad 0x3fb51b0730000000
+ .quad 0x3fb60658a0000000
+ .quad 0x3fb6f0d280000000
+ .quad 0x3fb7da7660000000
+ .quad 0x3fb8c345d0000000
+ .quad 0x3fb9ab4240000000
+ .quad 0x3fba926d30000000
+ .quad 0x3fbb78c820000000
+ .quad 0x3fbc5e5480000000
+ .quad 0x3fbd4313d0000000
+ .quad 0x3fbe270760000000
+ .quad 0x3fbf0a30c0000000
+ .quad 0x3fbfec9130000000
+ .quad 0x3fc0671510000000
+ .quad 0x3fc0d77e70000000
+ .quad 0x3fc1478580000000
+ .quad 0x3fc1b72ad0000000
+ .quad 0x3fc2266f10000000
+ .quad 0x3fc29552f0000000
+ .quad 0x3fc303d710000000
+ .quad 0x3fc371fc20000000
+ .quad 0x3fc3dfc2b0000000
+ .quad 0x3fc44d2b60000000
+ .quad 0x3fc4ba36f0000000
+ .quad 0x3fc526e5e0000000
+ .quad 0x3fc59338d0000000
+ .quad 0x3fc5ff3070000000
+ .quad 0x3fc66acd40000000
+ .quad 0x3fc6d60fe0000000
+ .quad 0x3fc740f8f0000000
+ .quad 0x3fc7ab8900000000
+ .quad 0x3fc815c0a0000000
+ .quad 0x3fc87fa060000000
+ .quad 0x3fc8e928d0000000
+ .quad 0x3fc9525a90000000
+ .quad 0x3fc9bb3620000000
+ .quad 0x3fca23bc10000000
+ .quad 0x3fca8becf0000000
+ .quad 0x3fcaf3c940000000
+ .quad 0x3fcb5b5190000000
+ .quad 0x3fcbc28670000000
+ .quad 0x3fcc296850000000
+ .quad 0x3fcc8ff7c0000000
+ .quad 0x3fccf63540000000
+ .quad 0x3fcd5c2160000000
+ .quad 0x3fcdc1bca0000000
+ .quad 0x3fce270760000000
+ .quad 0x3fce8c0250000000
+ .quad 0x3fcef0adc0000000
+ .quad 0x3fcf550a50000000
+ .quad 0x3fcfb91860000000
+ .quad 0x3fd00e6c40000000
+ .quad 0x3fd0402590000000
+ .quad 0x3fd071b850000000
+ .quad 0x3fd0a324e0000000
+ .quad 0x3fd0d46b50000000
+ .quad 0x3fd1058bf0000000
+ .quad 0x3fd1368700000000
+ .quad 0x3fd1675ca0000000
+ .quad 0x3fd1980d20000000
+ .quad 0x3fd1c898c0000000
+ .quad 0x3fd1f8ff90000000
+ .quad 0x3fd22941f0000000
+ .quad 0x3fd2596010000000
+ .quad 0x3fd2895a10000000
+ .quad 0x3fd2b93030000000
+ .quad 0x3fd2e8e2b0000000
+ .quad 0x3fd31871c0000000
+ .quad 0x3fd347dd90000000
+ .quad 0x3fd3772660000000
+ .quad 0x3fd3a64c50000000
+ .quad 0x3fd3d54fa0000000
+ .quad 0x3fd4043080000000
+ .quad 0x3fd432ef20000000
+ .quad 0x3fd4618bc0000000
+ .quad 0x3fd4900680000000
+ .quad 0x3fd4be5f90000000
+ .quad 0x3fd4ec9730000000
+ .quad 0x3fd51aad80000000
+ .quad 0x3fd548a2c0000000
+ .quad 0x3fd5767710000000
+ .quad 0x3fd5a42ab0000000
+ .quad 0x3fd5d1bdb0000000
+ .quad 0x3fd5ff3070000000
+ .quad 0x3fd62c82f0000000
+ .quad 0x3fd659b570000000
+ .quad 0x3fd686c810000000
+ .quad 0x3fd6b3bb20000000
+ .quad 0x3fd6e08ea0000000
+ .quad 0x3fd70d42e0000000
+ .quad 0x3fd739d7f0000000
+ .quad 0x3fd7664e10000000
+ .quad 0x3fd792a550000000
+ .quad 0x3fd7bede00000000
+ .quad 0x3fd7eaf830000000
+ .quad 0x3fd816f410000000
+ .quad 0x3fd842d1d0000000
+ .quad 0x3fd86e9190000000
+ .quad 0x3fd89a3380000000
+ .quad 0x3fd8c5b7c0000000
+ .quad 0x3fd8f11e80000000
+ .quad 0x3fd91c67e0000000
+ .quad 0x3fd9479410000000
+ .quad 0x3fd972a340000000
+ .quad 0x3fd99d9580000000
+ .quad 0x3fd9c86b00000000
+ .quad 0x3fd9f323e0000000
+ .quad 0x3fda1dc060000000
+ .quad 0x3fda484090000000
+ .quad 0x3fda72a490000000
+ .quad 0x3fda9cec90000000
+ .quad 0x3fdac718c0000000
+ .quad 0x3fdaf12930000000
+ .quad 0x3fdb1b1e00000000
+ .quad 0x3fdb44f770000000
+ .quad 0x3fdb6eb590000000
+ .quad 0x3fdb985890000000
+ .quad 0x3fdbc1e080000000
+ .quad 0x3fdbeb4d90000000
+ .quad 0x3fdc149ff0000000
+ .quad 0x3fdc3dd7a0000000
+ .quad 0x3fdc66f4e0000000
+ .quad 0x3fdc8ff7c0000000
+ .quad 0x3fdcb8e070000000
+ .quad 0x3fdce1af00000000
+ .quad 0x3fdd0a63a0000000
+ .quad 0x3fdd32fe70000000
+ .quad 0x3fdd5b7f90000000
+ .quad 0x3fdd83e720000000
+ .quad 0x3fddac3530000000
+ .quad 0x3fddd46a00000000
+ .quad 0x3fddfc8590000000
+ .quad 0x3fde248810000000
+ .quad 0x3fde4c71a0000000
+ .quad 0x3fde744260000000
+ .quad 0x3fde9bfa60000000
+ .quad 0x3fdec399d0000000
+ .quad 0x3fdeeb20c0000000
+ .quad 0x3fdf128f50000000
+ .quad 0x3fdf39e5b0000000
+ .quad 0x3fdf6123f0000000
+ .quad 0x3fdf884a30000000
+ .quad 0x3fdfaf5880000000
+ .quad 0x3fdfd64f20000000
+ .quad 0x3fdffd2e00000000
+ .quad 0x3fe011fab0000000
+ .quad 0x3fe02552a0000000
+ .quad 0x3fe0389ee0000000
+ .quad 0x3fe04bdf90000000
+ .quad 0x3fe05f14b0000000
+ .quad 0x3fe0723e50000000
+ .quad 0x3fe0855c80000000
+ .quad 0x3fe0986f40000000
+ .quad 0x3fe0ab76b0000000
+ .quad 0x3fe0be72e0000000
+ .quad 0x3fe0d163c0000000
+ .quad 0x3fe0e44980000000
+ .quad 0x3fe0f72410000000
+ .quad 0x3fe109f390000000
+ .quad 0x3fe11cb810000000
+ .quad 0x3fe12f7190000000
+ .quad 0x3fe1422020000000
+ .quad 0x3fe154c3d0000000
+ .quad 0x3fe1675ca0000000
+ .quad 0x3fe179eab0000000
+ .quad 0x3fe18c6e00000000
+ .quad 0x3fe19ee6b0000000
+ .quad 0x3fe1b154b0000000
+ .quad 0x3fe1c3b810000000
+ .quad 0x3fe1d610f0000000
+ .quad 0x3fe1e85f50000000
+ .quad 0x3fe1faa340000000
+ .quad 0x3fe20cdcd0000000
+ .quad 0x3fe21f0bf0000000
+ .quad 0x3fe23130d0000000
+ .quad 0x3fe2434b60000000
+ .quad 0x3fe2555bc0000000
+ .quad 0x3fe2676200000000
+ .quad 0x3fe2795e10000000
+ .quad 0x3fe28b5000000000
+ .quad 0x3fe29d37f0000000
+ .quad 0x3fe2af15f0000000
+ .quad 0x3fe2c0e9e0000000
+ .quad 0x3fe2d2b400000000
+ .quad 0x3fe2e47430000000
+ .quad 0x3fe2f62a90000000
+ .quad 0x3fe307d730000000
+ .quad 0x3fe3197a00000000
+ .quad 0x3fe32b1330000000
+ .quad 0x3fe33ca2b0000000
+ .quad 0x3fe34e2890000000
+ .quad 0x3fe35fa4e0000000
+ .quad 0x3fe37117b0000000
+ .quad 0x3fe38280f0000000
+ .quad 0x3fe393e0d0000000
+ .quad 0x3fe3a53730000000
+ .quad 0x3fe3b68440000000
+ .quad 0x3fe3c7c7f0000000
+ .quad 0x3fe3d90260000000
+ .quad 0x3fe3ea3390000000
+ .quad 0x3fe3fb5b80000000
+ .quad 0x3fe40c7a40000000
+ .quad 0x3fe41d8fe0000000
+ .quad 0x3fe42e9c60000000
+ .quad 0x3fe43f9fe0000000
+ .quad 0x3fe4509a50000000
+ .quad 0x3fe4618bc0000000
+ .quad 0x3fe4727430000000
+ .quad 0x3fe48353d0000000
+ .quad 0x3fe4942a80000000
+ .quad 0x3fe4a4f850000000
+ .quad 0x3fe4b5bd60000000
+ .quad 0x3fe4c679a0000000
+ .quad 0x3fe4d72d30000000
+ .quad 0x3fe4e7d810000000
+ .quad 0x3fe4f87a30000000
+ .quad 0x3fe50913c0000000
+ .quad 0x3fe519a4c0000000
+ .quad 0x3fe52a2d20000000
+ .quad 0x3fe53aad00000000
+ .quad 0x3fe54b2460000000
+ .quad 0x3fe55b9350000000
+ .quad 0x3fe56bf9d0000000
+ .quad 0x3fe57c57f0000000
+ .quad 0x3fe58cadb0000000
+ .quad 0x3fe59cfb20000000
+ .quad 0x3fe5ad4040000000
+ .quad 0x3fe5bd7d30000000
+ .quad 0x3fe5cdb1d0000000
+ .quad 0x3fe5ddde50000000
+ .quad 0x3fe5ee02a0000000
+ .quad 0x3fe5fe1ed0000000
+ .quad 0x3fe60e32f0000000
+ .quad 0x3fe61e3ef0000000
+ .quad 0x3fe62e42e0000000
+ .quad 0x0000000000000000
+
+.align 16
+.L__log_256_tail:
+ .quad 0x0000000000000000
+ .quad 0x3db5885e0250435a
+ .quad 0x3de620cf11f86ed2
+ .quad 0x3dff0214edba4a25
+ .quad 0x3dbf807c79f3db4e
+ .quad 0x3dea352ba779a52b
+ .quad 0x3dff56c46aa49fd5
+ .quad 0x3dfebe465fef5196
+ .quad 0x3e0cf0660099f1f8
+ .quad 0x3e1247b2ff85945d
+ .quad 0x3e13fd7abf5202b6
+ .quad 0x3e1f91c9a918d51e
+ .quad 0x3e08cb73f118d3ca
+ .quad 0x3e1d91c7d6fad074
+ .quad 0x3de1971bec28d14c
+ .quad 0x3e15b616a423c78a
+ .quad 0x3da162a6617cc971
+ .quad 0x3e166391c4c06d29
+ .quad 0x3e2d46f5c1d0c4b8
+ .quad 0x3e2e14282df1f6d3
+ .quad 0x3e186f47424a660d
+ .quad 0x3e2d4c8de077753e
+ .quad 0x3e2e0c307ed24f1c
+ .quad 0x3e226ea18763bdd3
+ .quad 0x3e25cad69737c933
+ .quad 0x3e2af62599088901
+ .quad 0x3e18c66c83d6b2d0
+ .quad 0x3e1880ceb36fb30f
+ .quad 0x3e2495aac6ca17a4
+ .quad 0x3e2761db4210878c
+ .quad 0x3e2eb78e862bac2f
+ .quad 0x3e19b2cd75790dd9
+ .quad 0x3e2c55e5cbd3d50f
+ .quad 0x3db162a6617cc971
+ .quad 0x3dfdbeabaaa2e519
+ .quad 0x3e1652cb7150c647
+ .quad 0x3e39a11cb2cd2ee2
+ .quad 0x3e219d0ab1a28813
+ .quad 0x3e24bd9e80a41811
+ .quad 0x3e3214b596faa3df
+ .quad 0x3e303fea46980bb8
+ .quad 0x3e31c8ffa5fd28c7
+ .quad 0x3dce8f743bcd96c5
+ .quad 0x3dfd98c5395315c6
+ .quad 0x3e3996fa3ccfa7b2
+ .quad 0x3e1cd2af2ad13037
+ .quad 0x3e1d0da1bd17200e
+ .quad 0x3e3330410ba68b75
+ .quad 0x3df4f27a790e7c41
+ .quad 0x3e13956a86f6ff1b
+ .quad 0x3e2c6748723551d9
+ .quad 0x3e2500de9326cdfc
+ .quad 0x3e1086c848df1b59
+ .quad 0x3e04357ead6836ff
+ .quad 0x3e24832442408024
+ .quad 0x3e3d10da8154b13d
+ .quad 0x3e39e8ad68ec8260
+ .quad 0x3e3cfbf706abaf18
+ .quad 0x3e3fc56ac6326e23
+ .quad 0x3e39105e3185cf21
+ .quad 0x3e3d017fe5b19cc0
+ .quad 0x3e3d1f6b48dd13fe
+ .quad 0x3e20b63358a7e73a
+ .quad 0x3e263063028c211c
+ .quad 0x3e2e6a6886b09760
+ .quad 0x3e3c138bb891cd03
+ .quad 0x3e369f7722b7221a
+ .quad 0x3df57d8fac1a628c
+ .quad 0x3e3c55e5cbd3d50f
+ .quad 0x3e1552d2ff48fe2e
+ .quad 0x3e37b8b26ca431bc
+ .quad 0x3e292decdc1c5f6d
+ .quad 0x3e3abc7c551aaa8c
+ .quad 0x3e36b540731a354b
+ .quad 0x3e32d341036b89ef
+ .quad 0x3e4f9ab21a3a2e0f
+ .quad 0x3e239c871afb9fbd
+ .quad 0x3e3e6add2c81f640
+ .quad 0x3e435c95aa313f41
+ .quad 0x3e249d4582f6cc53
+ .quad 0x3e47574c1c07398f
+ .quad 0x3e4ba846dece9e8d
+ .quad 0x3e16999fafbc68e7
+ .quad 0x3e4c9145e51b0103
+ .quad 0x3e479ef2cb44850a
+ .quad 0x3e0beec73de11275
+ .quad 0x3e2ef4351af5a498
+ .quad 0x3e45713a493b4a50
+ .quad 0x3e45c23a61385992
+ .quad 0x3e42a88309f57299
+ .quad 0x3e4530faa9ac8ace
+ .quad 0x3e25fec2d792a758
+ .quad 0x3e35a517a71cbcd7
+ .quad 0x3e3707dc3e1cd9a3
+ .quad 0x3e3a1a9f8ef43049
+ .quad 0x3e4409d0276b3674
+ .quad 0x3e20e2f613e85bd9
+ .quad 0x3df0027433001e5f
+ .quad 0x3e35dde2836d3265
+ .quad 0x3e2300134d7aaf04
+ .quad 0x3e3cb7e0b42724f5
+ .quad 0x3e2d6e93167e6308
+ .quad 0x3e3d1569b1526adb
+ .quad 0x3e0e99fc338a1a41
+ .quad 0x3e4eb01394a11b1c
+ .quad 0x3e04f27a790e7c41
+ .quad 0x3e25ce3ca97b7af9
+ .quad 0x3e281f0f940ed857
+ .quad 0x3e4d36295d88857c
+ .quad 0x3e21aca1ec4af526
+ .quad 0x3e445743c7182726
+ .quad 0x3e23c491aead337e
+ .quad 0x3e3aef401a738931
+ .quad 0x3e21cede76092a29
+ .quad 0x3e4fba8f44f82bb4
+ .quad 0x3e446f5f7f3c3e1a
+ .quad 0x3e47055f86c9674b
+ .quad 0x3e4b41a92b6b6e1a
+ .quad 0x3e443d162e927628
+ .quad 0x3e4466174013f9b1
+ .quad 0x3e3b05096ad69c62
+ .quad 0x3e40b169150faa58
+ .quad 0x3e3cd98b1df85da7
+ .quad 0x3e468b507b0f8fa8
+ .quad 0x3e48422df57499ba
+ .quad 0x3e11351586970274
+ .quad 0x3e117e08acba92ee
+ .quad 0x3e26e04314dd0229
+ .quad 0x3e497f3097e56d1a
+ .quad 0x3e3356e655901286
+ .quad 0x3e0cb761457f94d6
+ .quad 0x3e39af67a85a9dac
+ .quad 0x3e453410931a909f
+ .quad 0x3e22c587206058f5
+ .quad 0x3e223bc358899c22
+ .quad 0x3e4d7bf8b6d223cb
+ .quad 0x3e47991ec5197ddb
+ .quad 0x3e4a79e6bb3a9219
+ .quad 0x3e3a4c43ed663ec5
+ .quad 0x3e461b5a1484f438
+ .quad 0x3e4b4e36f7ef0c3a
+ .quad 0x3e115f026acd0d1b
+ .quad 0x3e3f36b535cecf05
+ .quad 0x3e2ffb7fbf3eb5c6
+ .quad 0x3e3e6a6886b09760
+ .quad 0x3e3135eb27f5bbc3
+ .quad 0x3e470be7d6f6fa57
+ .quad 0x3e4ce43cc84ab338
+ .quad 0x3e4c01d7aac3bd91
+ .quad 0x3e45c58d07961060
+ .quad 0x3e3628bcf941456e
+ .quad 0x3e4c58b2a8461cd2
+ .quad 0x3e33071282fb989a
+ .quad 0x3e420dab6a80f09c
+ .quad 0x3e44f8d84c397b1e
+ .quad 0x3e40d0ee08599e48
+ .quad 0x3e1d68787e37da36
+ .quad 0x3e366187d591bafc
+ .quad 0x3e22346600bae772
+ .quad 0x3e390377d0d61b8e
+ .quad 0x3e4f5e0dd966b907
+ .quad 0x3e49023cb79a00e2
+ .quad 0x3e44e05158c28ad8
+ .quad 0x3e3bfa7b08b18ae4
+ .quad 0x3e4ef1e63db35f67
+ .quad 0x3e0ec2ae39493d4f
+ .quad 0x3e40afe930ab2fa0
+ .quad 0x3e225ff8a1810dd4
+ .quad 0x3e469743fb1a71a5
+ .quad 0x3e5f9cc676785571
+ .quad 0x3e5b524da4cbf982
+ .quad 0x3e5a4c8b381535b8
+ .quad 0x3e5839be809caf2c
+ .quad 0x3e50968a1cb82c13
+ .quad 0x3e5eae6a41723fb5
+ .quad 0x3e5d9c29a380a4db
+ .quad 0x3e4094aa0ada625e
+ .quad 0x3e5973ad6fc108ca
+ .quad 0x3e4747322fdbab97
+ .quad 0x3e593692fa9d4221
+ .quad 0x3e5c5a992dfbc7d9
+ .quad 0x3e4e1f33e102387a
+ .quad 0x3e464fbef14c048c
+ .quad 0x3e4490f513ca5e3b
+ .quad 0x3e37a6af4d4c799d
+ .quad 0x3e57574c1c07398f
+ .quad 0x3e57b133417f8c1c
+ .quad 0x3e5feb9e0c176514
+ .quad 0x3e419f25bb3172f7
+ .quad 0x3e45f68a7bbfb852
+ .quad 0x3e5ee278497929f1
+ .quad 0x3e5ccee006109d58
+ .quad 0x3e5ce081a07bd8b3
+ .quad 0x3e570e12981817b8
+ .quad 0x3e292ab6d93503d0
+ .quad 0x3e58cb7dd7c3b61e
+ .quad 0x3e4efafd0a0b78da
+ .quad 0x3e5e907267c4288e
+ .quad 0x3e5d31ef96780875
+ .quad 0x3e23430dfcd2ad50
+ .quad 0x3e344d88d75bc1f9
+ .quad 0x3e5bec0f055e04fc
+ .quad 0x3e5d85611590b9ad
+ .quad 0x3df320568e583229
+ .quad 0x3e5a891d1772f538
+ .quad 0x3e22edc9dabba74d
+ .quad 0x3e4b9009a1015086
+ .quad 0x3e52a12a8c5b1a19
+ .quad 0x3e3a7885f0fdac85
+ .quad 0x3e5f4ffcd43ac691
+ .quad 0x3e52243ae2640aad
+ .quad 0x3e546513299035d3
+ .quad 0x3e5b39c3a62dd725
+ .quad 0x3e5ba6dd40049f51
+ .quad 0x3e451d1ed7177409
+ .quad 0x3e5cb0f2fd7f5216
+ .quad 0x3e3ab150cd4e2213
+ .quad 0x3e5cfd7bf3193844
+ .quad 0x3e53fff8455f1dbd
+ .quad 0x3e5fee640b905fc9
+ .quad 0x3e54e2adf548084c
+ .quad 0x3e3b597adc1ecdd2
+ .quad 0x3e4345bd096d3a75
+ .quad 0x3e5101b9d2453c8b
+ .quad 0x3e508ce55cc8c979
+ .quad 0x3e5bbf017e595f71
+ .quad 0x3e37ce733bd393dc
+ .quad 0x3e233bb0a503f8a1
+ .quad 0x3e30e2f613e85bd9
+ .quad 0x3e5e67555a635b3c
+ .quad 0x3e2ea88df73d5e8b
+ .quad 0x3e3d17e03bda18a8
+ .quad 0x3e5b607d76044f7e
+ .quad 0x3e52adc4e71bc2fc
+ .quad 0x3e5f99dc7362d1d9
+ .quad 0x3e5473fa008e6a6a
+ .quad 0x3e2b75bb09cb0985
+ .quad 0x3e5ea04dd10b9aba
+ .quad 0x3e5802d0d6979674
+ .quad 0x3e174688ccd99094
+ .quad 0x3e496f16abb9df22
+ .quad 0x3e46e66df2aa374f
+ .quad 0x3e4e66525ea4550a
+ .quad 0x3e42d02f34f20cbd
+ .quad 0x3e46cfce65047188
+ .quad 0x3e39b78c842d58b8
+ .quad 0x3e4735e624c24bc9
+ .quad 0x3e47eba1f7dd1adf
+ .quad 0x3e586b3e59f65355
+ .quad 0x3e1ce38e637f1b4d
+ .quad 0x3e58d82ec919edc7
+ .quad 0x3e4c52648ddcfa37
+ .quad 0x3e52482ceae1ac12
+ .quad 0x3e55a312311aba4f
+ .quad 0x3e411e236329f225
+ .quad 0x3e5b48c8cd2f246c
+ .quad 0x3e6efa39ef35793c
+ .quad 0x0000000000000000
+
+.align 16
+.L__log_F_inv_head:
+ .quad 0x4000000000000000
+ .quad 0x3fffe00000000000
+ .quad 0x3fffc00000000000
+ .quad 0x3fffa00000000000
+ .quad 0x3fff800000000000
+ .quad 0x3fff600000000000
+ .quad 0x3fff400000000000
+ .quad 0x3fff200000000000
+ .quad 0x3fff000000000000
+ .quad 0x3ffee00000000000
+ .quad 0x3ffec00000000000
+ .quad 0x3ffea00000000000
+ .quad 0x3ffe900000000000
+ .quad 0x3ffe700000000000
+ .quad 0x3ffe500000000000
+ .quad 0x3ffe300000000000
+ .quad 0x3ffe100000000000
+ .quad 0x3ffe000000000000
+ .quad 0x3ffde00000000000
+ .quad 0x3ffdc00000000000
+ .quad 0x3ffda00000000000
+ .quad 0x3ffd900000000000
+ .quad 0x3ffd700000000000
+ .quad 0x3ffd500000000000
+ .quad 0x3ffd400000000000
+ .quad 0x3ffd200000000000
+ .quad 0x3ffd000000000000
+ .quad 0x3ffcf00000000000
+ .quad 0x3ffcd00000000000
+ .quad 0x3ffcb00000000000
+ .quad 0x3ffca00000000000
+ .quad 0x3ffc800000000000
+ .quad 0x3ffc700000000000
+ .quad 0x3ffc500000000000
+ .quad 0x3ffc300000000000
+ .quad 0x3ffc200000000000
+ .quad 0x3ffc000000000000
+ .quad 0x3ffbf00000000000
+ .quad 0x3ffbd00000000000
+ .quad 0x3ffbc00000000000
+ .quad 0x3ffba00000000000
+ .quad 0x3ffb900000000000
+ .quad 0x3ffb700000000000
+ .quad 0x3ffb600000000000
+ .quad 0x3ffb400000000000
+ .quad 0x3ffb300000000000
+ .quad 0x3ffb200000000000
+ .quad 0x3ffb000000000000
+ .quad 0x3ffaf00000000000
+ .quad 0x3ffad00000000000
+ .quad 0x3ffac00000000000
+ .quad 0x3ffaa00000000000
+ .quad 0x3ffa900000000000
+ .quad 0x3ffa800000000000
+ .quad 0x3ffa600000000000
+ .quad 0x3ffa500000000000
+ .quad 0x3ffa400000000000
+ .quad 0x3ffa200000000000
+ .quad 0x3ffa100000000000
+ .quad 0x3ffa000000000000
+ .quad 0x3ff9e00000000000
+ .quad 0x3ff9d00000000000
+ .quad 0x3ff9c00000000000
+ .quad 0x3ff9a00000000000
+ .quad 0x3ff9900000000000
+ .quad 0x3ff9800000000000
+ .quad 0x3ff9700000000000
+ .quad 0x3ff9500000000000
+ .quad 0x3ff9400000000000
+ .quad 0x3ff9300000000000
+ .quad 0x3ff9200000000000
+ .quad 0x3ff9000000000000
+ .quad 0x3ff8f00000000000
+ .quad 0x3ff8e00000000000
+ .quad 0x3ff8d00000000000
+ .quad 0x3ff8b00000000000
+ .quad 0x3ff8a00000000000
+ .quad 0x3ff8900000000000
+ .quad 0x3ff8800000000000
+ .quad 0x3ff8700000000000
+ .quad 0x3ff8600000000000
+ .quad 0x3ff8400000000000
+ .quad 0x3ff8300000000000
+ .quad 0x3ff8200000000000
+ .quad 0x3ff8100000000000
+ .quad 0x3ff8000000000000
+ .quad 0x3ff7f00000000000
+ .quad 0x3ff7e00000000000
+ .quad 0x3ff7d00000000000
+ .quad 0x3ff7b00000000000
+ .quad 0x3ff7a00000000000
+ .quad 0x3ff7900000000000
+ .quad 0x3ff7800000000000
+ .quad 0x3ff7700000000000
+ .quad 0x3ff7600000000000
+ .quad 0x3ff7500000000000
+ .quad 0x3ff7400000000000
+ .quad 0x3ff7300000000000
+ .quad 0x3ff7200000000000
+ .quad 0x3ff7100000000000
+ .quad 0x3ff7000000000000
+ .quad 0x3ff6f00000000000
+ .quad 0x3ff6e00000000000
+ .quad 0x3ff6d00000000000
+ .quad 0x3ff6c00000000000
+ .quad 0x3ff6b00000000000
+ .quad 0x3ff6a00000000000
+ .quad 0x3ff6900000000000
+ .quad 0x3ff6800000000000
+ .quad 0x3ff6700000000000
+ .quad 0x3ff6600000000000
+ .quad 0x3ff6500000000000
+ .quad 0x3ff6400000000000
+ .quad 0x3ff6300000000000
+ .quad 0x3ff6200000000000
+ .quad 0x3ff6100000000000
+ .quad 0x3ff6000000000000
+ .quad 0x3ff5f00000000000
+ .quad 0x3ff5e00000000000
+ .quad 0x3ff5d00000000000
+ .quad 0x3ff5c00000000000
+ .quad 0x3ff5b00000000000
+ .quad 0x3ff5a00000000000
+ .quad 0x3ff5900000000000
+ .quad 0x3ff5800000000000
+ .quad 0x3ff5800000000000
+ .quad 0x3ff5700000000000
+ .quad 0x3ff5600000000000
+ .quad 0x3ff5500000000000
+ .quad 0x3ff5400000000000
+ .quad 0x3ff5300000000000
+ .quad 0x3ff5200000000000
+ .quad 0x3ff5100000000000
+ .quad 0x3ff5000000000000
+ .quad 0x3ff5000000000000
+ .quad 0x3ff4f00000000000
+ .quad 0x3ff4e00000000000
+ .quad 0x3ff4d00000000000
+ .quad 0x3ff4c00000000000
+ .quad 0x3ff4b00000000000
+ .quad 0x3ff4a00000000000
+ .quad 0x3ff4a00000000000
+ .quad 0x3ff4900000000000
+ .quad 0x3ff4800000000000
+ .quad 0x3ff4700000000000
+ .quad 0x3ff4600000000000
+ .quad 0x3ff4600000000000
+ .quad 0x3ff4500000000000
+ .quad 0x3ff4400000000000
+ .quad 0x3ff4300000000000
+ .quad 0x3ff4200000000000
+ .quad 0x3ff4200000000000
+ .quad 0x3ff4100000000000
+ .quad 0x3ff4000000000000
+ .quad 0x3ff3f00000000000
+ .quad 0x3ff3e00000000000
+ .quad 0x3ff3e00000000000
+ .quad 0x3ff3d00000000000
+ .quad 0x3ff3c00000000000
+ .quad 0x3ff3b00000000000
+ .quad 0x3ff3b00000000000
+ .quad 0x3ff3a00000000000
+ .quad 0x3ff3900000000000
+ .quad 0x3ff3800000000000
+ .quad 0x3ff3800000000000
+ .quad 0x3ff3700000000000
+ .quad 0x3ff3600000000000
+ .quad 0x3ff3500000000000
+ .quad 0x3ff3500000000000
+ .quad 0x3ff3400000000000
+ .quad 0x3ff3300000000000
+ .quad 0x3ff3200000000000
+ .quad 0x3ff3200000000000
+ .quad 0x3ff3100000000000
+ .quad 0x3ff3000000000000
+ .quad 0x3ff3000000000000
+ .quad 0x3ff2f00000000000
+ .quad 0x3ff2e00000000000
+ .quad 0x3ff2e00000000000
+ .quad 0x3ff2d00000000000
+ .quad 0x3ff2c00000000000
+ .quad 0x3ff2b00000000000
+ .quad 0x3ff2b00000000000
+ .quad 0x3ff2a00000000000
+ .quad 0x3ff2900000000000
+ .quad 0x3ff2900000000000
+ .quad 0x3ff2800000000000
+ .quad 0x3ff2700000000000
+ .quad 0x3ff2700000000000
+ .quad 0x3ff2600000000000
+ .quad 0x3ff2500000000000
+ .quad 0x3ff2500000000000
+ .quad 0x3ff2400000000000
+ .quad 0x3ff2300000000000
+ .quad 0x3ff2300000000000
+ .quad 0x3ff2200000000000
+ .quad 0x3ff2100000000000
+ .quad 0x3ff2100000000000
+ .quad 0x3ff2000000000000
+ .quad 0x3ff2000000000000
+ .quad 0x3ff1f00000000000
+ .quad 0x3ff1e00000000000
+ .quad 0x3ff1e00000000000
+ .quad 0x3ff1d00000000000
+ .quad 0x3ff1c00000000000
+ .quad 0x3ff1c00000000000
+ .quad 0x3ff1b00000000000
+ .quad 0x3ff1b00000000000
+ .quad 0x3ff1a00000000000
+ .quad 0x3ff1900000000000
+ .quad 0x3ff1900000000000
+ .quad 0x3ff1800000000000
+ .quad 0x3ff1800000000000
+ .quad 0x3ff1700000000000
+ .quad 0x3ff1600000000000
+ .quad 0x3ff1600000000000
+ .quad 0x3ff1500000000000
+ .quad 0x3ff1500000000000
+ .quad 0x3ff1400000000000
+ .quad 0x3ff1300000000000
+ .quad 0x3ff1300000000000
+ .quad 0x3ff1200000000000
+ .quad 0x3ff1200000000000
+ .quad 0x3ff1100000000000
+ .quad 0x3ff1100000000000
+ .quad 0x3ff1000000000000
+ .quad 0x3ff0f00000000000
+ .quad 0x3ff0f00000000000
+ .quad 0x3ff0e00000000000
+ .quad 0x3ff0e00000000000
+ .quad 0x3ff0d00000000000
+ .quad 0x3ff0d00000000000
+ .quad 0x3ff0c00000000000
+ .quad 0x3ff0c00000000000
+ .quad 0x3ff0b00000000000
+ .quad 0x3ff0a00000000000
+ .quad 0x3ff0a00000000000
+ .quad 0x3ff0900000000000
+ .quad 0x3ff0900000000000
+ .quad 0x3ff0800000000000
+ .quad 0x3ff0800000000000
+ .quad 0x3ff0700000000000
+ .quad 0x3ff0700000000000
+ .quad 0x3ff0600000000000
+ .quad 0x3ff0600000000000
+ .quad 0x3ff0500000000000
+ .quad 0x3ff0500000000000
+ .quad 0x3ff0400000000000
+ .quad 0x3ff0400000000000
+ .quad 0x3ff0300000000000
+ .quad 0x3ff0300000000000
+ .quad 0x3ff0200000000000
+ .quad 0x3ff0200000000000
+ .quad 0x3ff0100000000000
+ .quad 0x3ff0100000000000
+ .quad 0x3ff0000000000000
+ .quad 0x3ff0000000000000
+
+.align 16
+.L__log_F_inv_tail:
+ .quad 0x0000000000000000
+ .quad 0x3effe01fe01fe020
+ .quad 0x3f1fc07f01fc07f0
+ .quad 0x3f31caa01fa11caa
+ .quad 0x3f3f81f81f81f820
+ .quad 0x3f48856506ddaba6
+ .quad 0x3f5196792909c560
+ .quad 0x3f57d9108c2ad433
+ .quad 0x3f5f07c1f07c1f08
+ .quad 0x3f638ff08b1c03dd
+ .quad 0x3f680f6603d980f6
+ .quad 0x3f6d00f57403d5d0
+ .quad 0x3f331abf0b7672a0
+ .quad 0x3f506a965d43919b
+ .quad 0x3f5ceb240795ceb2
+ .quad 0x3f6522f3b834e67f
+ .quad 0x3f6c3c3c3c3c3c3c
+ .quad 0x3f3e01e01e01e01e
+ .quad 0x3f575b8fe21a291c
+ .quad 0x3f6403b9403b9404
+ .quad 0x3f6cc0ed7303b5cc
+ .quad 0x3f479118f3fc4da2
+ .quad 0x3f5ed952e0b0ce46
+ .quad 0x3f695900eae56404
+ .quad 0x3f3d41d41d41d41d
+ .quad 0x3f5cb28ff16c69ae
+ .quad 0x3f696b1edd80e866
+ .quad 0x3f4372e225fe30d9
+ .quad 0x3f60ad12073615a2
+ .quad 0x3f6cdb2c0397cdb3
+ .quad 0x3f52cc157b864407
+ .quad 0x3f664cb5f7148404
+ .quad 0x3f3c71c71c71c71c
+ .quad 0x3f6129a21a930b84
+ .quad 0x3f6f1e0387f1e038
+ .quad 0x3f5ad4e4ba80709b
+ .quad 0x3f6c0e070381c0e0
+ .quad 0x3f560fba1a362bb0
+ .quad 0x3f6a5713280dee96
+ .quad 0x3f53f59620f9ece9
+ .quad 0x3f69f22983759f23
+ .quad 0x3f5478ac63fc8d5c
+ .quad 0x3f6ad87bb4671656
+ .quad 0x3f578b8efbb8148c
+ .quad 0x3f6d0369d0369d03
+ .quad 0x3f5d212b601b3748
+ .quad 0x3f0b2036406c80d9
+ .quad 0x3f629663b24547d1
+ .quad 0x3f4435e50d79435e
+ .quad 0x3f67d0ff2920bc03
+ .quad 0x3f55c06b15c06b16
+ .quad 0x3f6e3a5f0fd7f954
+ .quad 0x3f61dec0d4c77b03
+ .quad 0x3f473289870ac52e
+ .quad 0x3f6a034da034da03
+ .quad 0x3f5d041da2292856
+ .quad 0x3f3a41a41a41a41a
+ .quad 0x3f68550f8a39409d
+ .quad 0x3f5b4fe5e92c0686
+ .quad 0x3f3a01a01a01a01a
+ .quad 0x3f691d2a2067b23a
+ .quad 0x3f5e7c5dada0b4e5
+ .quad 0x3f468a7725080ce1
+ .quad 0x3f6c49d4aa21b490
+ .quad 0x3f63333333333333
+ .quad 0x3f54bc363b03fccf
+ .quad 0x3f2c9f01970e4f81
+ .quad 0x3f697617c6ef5b25
+ .quad 0x3f6161f9add3c0ca
+ .quad 0x3f5319fe6cb39806
+ .quad 0x3f2f693a1c451ab3
+ .quad 0x3f6a9e240321a9e2
+ .quad 0x3f63831f3831f383
+ .quad 0x3f5949ebc4dcfc1c
+ .quad 0x3f480c6980c6980c
+ .quad 0x3f6f9d00c5fe7403
+ .quad 0x3f69721ed7e75347
+ .quad 0x3f6381ec0313381f
+ .quad 0x3f5b97c2aec12653
+ .quad 0x3f509ef3024ae3ba
+ .quad 0x3f38618618618618
+ .quad 0x3f6e0184f00c2780
+ .quad 0x3f692ef5657dba52
+ .quad 0x3f64940305494030
+ .quad 0x3f60303030303030
+ .quad 0x3f58060180601806
+ .quad 0x3f5017f405fd017f
+ .quad 0x3f412a8ad278e8dd
+ .quad 0x3f17d05f417d05f4
+ .quad 0x3f6d67245c02f7d6
+ .quad 0x3f6a4411c1d986a9
+ .quad 0x3f6754d76c7316df
+ .quad 0x3f649902f149902f
+ .quad 0x3f621023358c1a68
+ .quad 0x3f5f7390d2a6c406
+ .quad 0x3f5b2b0805d5b2b1
+ .quad 0x3f5745d1745d1746
+ .quad 0x3f53c31507fa32c4
+ .quad 0x3f50a1fd1b7af017
+ .quad 0x3f4bc36ce3e0453a
+ .quad 0x3f4702e05c0b8170
+ .quad 0x3f4300b79300b793
+ .quad 0x3f3f76b4337c6cb1
+ .quad 0x3f3a62681c860fb0
+ .quad 0x3f36c16c16c16c17
+ .quad 0x3f3490aa31a3cfc7
+ .quad 0x3f33cd153729043e
+ .quad 0x3f3473a88d0bfd2e
+ .quad 0x3f36816816816817
+ .quad 0x3f39f36016719f36
+ .quad 0x3f3ec6a5122f9016
+ .quad 0x3f427c29da5519cf
+ .quad 0x3f4642c8590b2164
+ .quad 0x3f4ab5c45606f00b
+ .quad 0x3f4fd3b80b11fd3c
+ .quad 0x3f52cda0c6ba4eaa
+ .quad 0x3f56058160581606
+ .quad 0x3f5990d0a4b7ef87
+ .quad 0x3f5d6ee340579d6f
+ .quad 0x3f60cf87d9c54a69
+ .quad 0x3f6310572620ae4c
+ .quad 0x3f65798c8ff522a2
+ .quad 0x3f680ad602b580ad
+ .quad 0x3f6ac3e24799546f
+ .quad 0x3f6da46102b1da46
+ .quad 0x3f15805601580560
+ .quad 0x3f3ed3c506b39a23
+ .quad 0x3f4cbdd3e2970f60
+ .quad 0x3f55555555555555
+ .quad 0x3f5c979aee0bf805
+ .quad 0x3f621291e81fd58e
+ .quad 0x3f65fead500a9580
+ .quad 0x3f6a0fd5c5f02a3a
+ .quad 0x3f6e45c223898adc
+ .quad 0x3f35015015015015
+ .quad 0x3f4c7b16ea64d422
+ .quad 0x3f57829cbc14e5e1
+ .quad 0x3f60877db8589720
+ .quad 0x3f65710e4b5edcea
+ .quad 0x3f6a7dbb4d1fc1c8
+ .quad 0x3f6fad40a57eb503
+ .quad 0x3f43fd6bb00a5140
+ .quad 0x3f54e78ecb419ba9
+ .quad 0x3f600a44029100a4
+ .quad 0x3f65c28f5c28f5c3
+ .quad 0x3f6b9c68b2c0cc4a
+ .quad 0x3f2978feb9f34381
+ .quad 0x3f4ecf163bb6500a
+ .quad 0x3f5be1958b67ebb9
+ .quad 0x3f644e6157dc9a3b
+ .quad 0x3f6acc4baa3f0ddf
+ .quad 0x3f26a4cbcb2a247b
+ .quad 0x3f50505050505050
+ .quad 0x3f5e0b4439959819
+ .quad 0x3f66027f6027f602
+ .quad 0x3f6d1e854b5e0db4
+ .quad 0x3f4165e7254813e2
+ .quad 0x3f576646a9d716ef
+ .quad 0x3f632b48f757ce88
+ .quad 0x3f6ac1b24652a906
+ .quad 0x3f33b13b13b13b14
+ .quad 0x3f5490e1eb208984
+ .quad 0x3f62385830fec66e
+ .quad 0x3f6a45a6cc111b7e
+ .quad 0x3f33813813813814
+ .quad 0x3f556f472517b708
+ .quad 0x3f631be7bc0e8f2a
+ .quad 0x3f6b9cbf3e55f044
+ .quad 0x3f40e7d95bc609a9
+ .quad 0x3f59e6b3804d19e7
+ .quad 0x3f65c8b6af7963c2
+ .quad 0x3f6eb9dad43bf402
+ .quad 0x3f4f1a515885fb37
+ .quad 0x3f60eeb1d3d76c02
+ .quad 0x3f6a320261a32026
+ .quad 0x3f3c82ac40260390
+ .quad 0x3f5a12f684bda12f
+ .quad 0x3f669d43fda2962c
+ .quad 0x3f02e025c04b8097
+ .quad 0x3f542804b542804b
+ .quad 0x3f63f69b02593f6a
+ .quad 0x3f6df31cb46e21fa
+ .quad 0x3f5012b404ad012b
+ .quad 0x3f623925e7820a7f
+ .quad 0x3f6c8253c8253c82
+ .quad 0x3f4b92ddc02526e5
+ .quad 0x3f61602511602511
+ .quad 0x3f6bf471439c9adf
+ .quad 0x3f4a85c40939a85c
+ .quad 0x3f6166f9ac024d16
+ .quad 0x3f6c44e10125e227
+ .quad 0x3f4cebf48bbd90e5
+ .quad 0x3f62492492492492
+ .quad 0x3f6d6f2e2ec0b673
+ .quad 0x3f5159e26af37c05
+ .quad 0x3f64024540245402
+ .quad 0x3f6f6f0243f6f024
+ .quad 0x3f55e60121579805
+ .quad 0x3f668e18cf81b10f
+ .quad 0x3f32012012012012
+ .quad 0x3f5c11f7047dc11f
+ .quad 0x3f69e878ff70985e
+ .quad 0x3f4779d9fdc3a219
+ .quad 0x3f61eace5c957907
+ .quad 0x3f6e0d5b450239e1
+ .quad 0x3f548bf073816367
+ .quad 0x3f6694808dda5202
+ .quad 0x3f37c67f2bae2b21
+ .quad 0x3f5ee58469ee5847
+ .quad 0x3f6c0233c0233c02
+ .quad 0x3f514e02328a7012
+ .quad 0x3f6561072057b573
+ .quad 0x3f31811811811812
+ .quad 0x3f5e28646f5a1060
+ .quad 0x3f6c0d1284e6f1d7
+ .quad 0x3f523543f0c80459
+ .quad 0x3f663cbeea4e1a09
+ .quad 0x3f3b9a3fdd5c8cb8
+ .quad 0x3f60be1c159a76d2
+ .quad 0x3f6e1d1a688e4838
+ .quad 0x3f572044d72044d7
+ .quad 0x3f691713db81577b
+ .quad 0x3f4ac73ae9819b50
+ .quad 0x3f6460334e904cf6
+ .quad 0x3f31111111111111
+ .quad 0x3f5feef80441fef0
+ .quad 0x3f6de021fde021fe
+ .quad 0x3f57b7eacc9686a0
+ .quad 0x3f69ead7cd391fbc
+ .quad 0x3f50195609804390
+ .quad 0x3f6641511e8d2b32
+ .quad 0x3f4222b1acf1ce96
+ .quad 0x3f62e29f79b47582
+ .quad 0x3f24f0d1682e11cd
+ .quad 0x3f5f9bb096771e4d
+ .quad 0x3f6e5ee45dd96ae2
+ .quad 0x3f5a0429a0429a04
+ .quad 0x3f6bb74d5f06c021
+ .quad 0x3f54fce404254fce
+ .quad 0x3f695766eacbc402
+ .quad 0x3f50842108421084
+ .quad 0x3f673e5371d5c338
+ .quad 0x3f4930523fbe3368
+ .quad 0x3f656b38f225f6c4
+ .quad 0x3f426e978d4fdf3b
+ .quad 0x3f63dd40e4eb0cc6
+ .quad 0x3f397f7d73404146
+ .quad 0x3f6293982cc98af1
+ .quad 0x3f30410410410410
+ .quad 0x3f618d6f048ff7e4
+ .quad 0x3f2236a3ebc349de
+ .quad 0x3f60c9f8ee53d18c
+ .quad 0x3f10204081020408
+ .quad 0x3f60486ca2f46ea6
+ .quad 0x3ef0101010101010
+ .quad 0x3f60080402010080
+ .quad 0x0000000000000000
+
+#---------------------
+# exp data
+#---------------------
+
+.align 16
+
+.L__denormal_threshold: .long 0x0fffffc02 # -1022
+ .long 0
+ .quad 0
+
+.L__enable_almost_inf: .quad 0x7fe0000000000000
+ .quad 0
+
+.L__real_zero: .quad 0x0000000000000000
+ .quad 0
+
+.L__real_smallest_denormal: .quad 0x0000000000000001
+ .quad 0
+.L__denormal_tiny_threshold: .quad 0x0c0874046dfefd9d0
+ .quad 0
+
+.L__real_p65536: .quad 0x40f0000000000000 # 65536
+ .quad 0
+.L__real_m68800: .quad 0x0c0f0cc0000000000 # -68800
+ .quad 0
+.L__real_64_by_log2: .quad 0x40571547652b82fe # 64/ln(2)
+ .quad 0
+.L__real_log2_by_64_head: .quad 0x3f862e42f0000000 # log2_by_64_head
+ .quad 0
+.L__real_log2_by_64_tail: .quad 0x0bdfdf473de6af278 # -log2_by_64_tail
+ .quad 0
+.L__real_1_by_720: .quad 0x3f56c16c16c16c17 # 1/720
+ .quad 0
+.L__real_1_by_120: .quad 0x3f81111111111111 # 1/120
+ .quad 0
+.L__real_1_by_24: .quad 0x3fa5555555555555 # 1/24
+ .quad 0
+.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6
+ .quad 0
+.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2
+ .quad 0
+
+.align 16
+.L__two_to_jby64_head_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff02c9a30000000
+ .quad 0x3ff059b0d0000000
+ .quad 0x3ff0874510000000
+ .quad 0x3ff0b55860000000
+ .quad 0x3ff0e3ec30000000
+ .quad 0x3ff11301d0000000
+ .quad 0x3ff1429aa0000000
+ .quad 0x3ff172b830000000
+ .quad 0x3ff1a35be0000000
+ .quad 0x3ff1d48730000000
+ .quad 0x3ff2063b80000000
+ .quad 0x3ff2387a60000000
+ .quad 0x3ff26b4560000000
+ .quad 0x3ff29e9df0000000
+ .quad 0x3ff2d285a0000000
+ .quad 0x3ff306fe00000000
+ .quad 0x3ff33c08b0000000
+ .quad 0x3ff371a730000000
+ .quad 0x3ff3a7db30000000
+ .quad 0x3ff3dea640000000
+ .quad 0x3ff4160a20000000
+ .quad 0x3ff44e0860000000
+ .quad 0x3ff486a2b0000000
+ .quad 0x3ff4bfdad0000000
+ .quad 0x3ff4f9b270000000
+ .quad 0x3ff5342b50000000
+ .quad 0x3ff56f4730000000
+ .quad 0x3ff5ab07d0000000
+ .quad 0x3ff5e76f10000000
+ .quad 0x3ff6247eb0000000
+ .quad 0x3ff6623880000000
+ .quad 0x3ff6a09e60000000
+ .quad 0x3ff6dfb230000000
+ .quad 0x3ff71f75e0000000
+ .quad 0x3ff75feb50000000
+ .quad 0x3ff7a11470000000
+ .quad 0x3ff7e2f330000000
+ .quad 0x3ff8258990000000
+ .quad 0x3ff868d990000000
+ .quad 0x3ff8ace540000000
+ .quad 0x3ff8f1ae90000000
+ .quad 0x3ff93737b0000000
+ .quad 0x3ff97d8290000000
+ .quad 0x3ff9c49180000000
+ .quad 0x3ffa0c6670000000
+ .quad 0x3ffa5503b0000000
+ .quad 0x3ffa9e6b50000000
+ .quad 0x3ffae89f90000000
+ .quad 0x3ffb33a2b0000000
+ .quad 0x3ffb7f76f0000000
+ .quad 0x3ffbcc1e90000000
+ .quad 0x3ffc199bd0000000
+ .quad 0x3ffc67f120000000
+ .quad 0x3ffcb720d0000000
+ .quad 0x3ffd072d40000000
+ .quad 0x3ffd5818d0000000
+ .quad 0x3ffda9e600000000
+ .quad 0x3ffdfc9730000000
+ .quad 0x3ffe502ee0000000
+ .quad 0x3ffea4afa0000000
+ .quad 0x3ffefa1be0000000
+ .quad 0x3fff507650000000
+ .quad 0x3fffa7c180000000
+
+.align 16
+.L__two_to_jby64_tail_table:
+ .quad 0x0000000000000000
+ .quad 0x3e6cef00c1dcdef9
+ .quad 0x3e48ac2ba1d73e2a
+ .quad 0x3e60eb37901186be
+ .quad 0x3e69f3121ec53172
+ .quad 0x3e469e8d10103a17
+ .quad 0x3df25b50a4ebbf1a
+ .quad 0x3e6d525bbf668203
+ .quad 0x3e68faa2f5b9bef9
+ .quad 0x3e66df96ea796d31
+ .quad 0x3e368b9aa7805b80
+ .quad 0x3e60c519ac771dd6
+ .quad 0x3e6ceac470cd83f5
+ .quad 0x3e5789f37495e99c
+ .quad 0x3e547f7b84b09745
+ .quad 0x3e5b900c2d002475
+ .quad 0x3e64636e2a5bd1ab
+ .quad 0x3e4320b7fa64e430
+ .quad 0x3e5ceaa72a9c5154
+ .quad 0x3e53967fdba86f24
+ .quad 0x3e682468446b6824
+ .quad 0x3e3f72e29f84325b
+ .quad 0x3e18624b40c4dbd0
+ .quad 0x3e5704f3404f068e
+ .quad 0x3e54d8a89c750e5e
+ .quad 0x3e5a74b29ab4cf62
+ .quad 0x3e5a753e077c2a0f
+ .quad 0x3e5ad49f699bb2c0
+ .quad 0x3e6a90a852b19260
+ .quad 0x3e56b48521ba6f93
+ .quad 0x3e0d2ac258f87d03
+ .quad 0x3e42a91124893ecf
+ .quad 0x3e59fcef32422cbe
+ .quad 0x3e68ca345de441c5
+ .quad 0x3e61d8bee7ba46e1
+ .quad 0x3e59099f22fdba6a
+ .quad 0x3e4f580c36bea881
+ .quad 0x3e5b3d398841740a
+ .quad 0x3e62999c25159f11
+ .quad 0x3e668925d901c83b
+ .quad 0x3e415506dadd3e2a
+ .quad 0x3e622aee6c57304e
+ .quad 0x3e29b8bc9e8a0387
+ .quad 0x3e6fbc9c9f173d24
+ .quad 0x3e451f8480e3e235
+ .quad 0x3e66bbcac96535b5
+ .quad 0x3e41f12ae45a1224
+ .quad 0x3e55e7f6fd0fac90
+ .quad 0x3e62b5a75abd0e69
+ .quad 0x3e609e2bf5ed7fa1
+ .quad 0x3e47daf237553d84
+ .quad 0x3e12f074891ee83d
+ .quad 0x3e6b0aa538444196
+ .quad 0x3e6cafa29694426f
+ .quad 0x3e69df20d22a0797
+ .quad 0x3e640f12f71a1e45
+ .quad 0x3e69f7490e4bb40b
+ .quad 0x3e4ed9942b84600d
+ .quad 0x3e4bdcdaf5cb4656
+ .quad 0x3e5e2cffd89cf44c
+ .quad 0x3e452486cc2c7b9d
+ .quad 0x3e6cc2b44eee3fa4
+ .quad 0x3e66dc8a80ce9f09
+ .quad 0x3e39e90d82e90a7e
+
+
+#endif
diff --git a/src/gas/powf.S b/src/gas/powf.S
new file mode 100644
index 0000000..96eefd2
--- /dev/null
+++ b/src/gas/powf.S
@@ -0,0 +1,1040 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# powf.S
+#
+# An implementation of the powf libm function.
+#
+# Prototype:
+#
+# float powf(float x, float y);
+#
+
+#
+# Algorithm:
+# x^y = e^(y*ln(x))
+#
+# Look in exp, log for the respective algorithms
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(powf)
+#define fname_special _powf_special@PLT
+
+
+# local variable storage offsets
+.equ save_x, 0x0
+.equ save_y, 0x10
+.equ p_temp_exp, 0x20
+.equ negate_result, 0x30
+.equ save_ax, 0x40
+.equ y_head, 0x50
+.equ p_temp_log, 0x60
+.equ stack_size, 0x78
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ sub $stack_size, %rsp
+
+ movss %xmm0, save_x(%rsp)
+ movss %xmm1, save_y(%rsp)
+
+ mov save_x(%rsp), %edx
+ mov save_y(%rsp), %r8d
+
+ mov .L__f32_exp_mant_mask(%rip), %r10d
+ and %r8d, %r10d
+ jz .L__y_is_zero
+
+ cmp .L__f32_pos_one(%rip), %r8d
+ je .L__y_is_one
+
+ mov .L__f32_sign_mask(%rip), %r9d
+ and %edx, %r9d
+ cmp .L__f32_sign_mask(%rip), %r9d
+ mov .L__f32_pos_zero(%rip), %eax
+ mov %eax, negate_result(%rsp)
+ je .L__x_is_neg
+
+ cmp .L__f32_pos_one(%rip), %edx
+ je .L__x_is_pos_one
+
+ cmp .L__f32_pos_zero(%rip), %edx
+ je .L__x_is_zero
+
+ mov .L__f32_exp_mask(%rip), %r9d
+ and %edx, %r9d
+ cmp .L__f32_exp_mask(%rip), %r9d
+ je .L__x_is_inf_or_nan
+
+ mov .L__f32_exp_mask(%rip), %r10d
+ and %r8d, %r10d
+ cmp .L__f32_ay_max_bound(%rip), %r10d
+ jg .L__ay_is_very_large
+
+ mov .L__f32_exp_mask(%rip), %r10d
+ and %r8d, %r10d
+ cmp .L__f32_ay_min_bound(%rip), %r10d
+ jl .L__ay_is_very_small
+
+ # -----------------------------
+ # compute log(x) here
+ # -----------------------------
+.L__log_x:
+
+ movss save_y(%rsp), %xmm7
+ cvtss2sd %xmm0, %xmm0
+ cvtss2sd %xmm7, %xmm7
+ movsd %xmm7, save_y(%rsp)
+
+ # compute exponent part
+ xor %r8, %r8
+ movdqa %xmm0, %xmm3
+ psrlq $52, %xmm3
+ movd %xmm0, %r8
+ psubq .L__mask_1023(%rip), %xmm3
+ movdqa %xmm0, %xmm2
+ cvtdq2pd %xmm3, %xmm6 # xexp
+ pand .L__real_mant(%rip), %xmm2
+
+ # compute index into the log tables
+ mov %r8, %r9
+ and .L__mask_mant_all7(%rip), %r8
+ and .L__mask_mant8(%rip), %r9
+ shl %r9
+ add %r9, %r8
+ mov %r8, p_temp_log(%rsp)
+
+ # F, Y
+ movsd p_temp_log(%rsp), %xmm1
+ shr $45, %r8
+ por .L__real_half(%rip), %xmm2
+ por .L__real_half(%rip), %xmm1
+ lea .L__log_F_inv(%rip), %r9
+
+ # f = F - Y, r = f * inv
+ subsd %xmm2, %xmm1
+ mulsd (%r9,%r8,8), %xmm1
+ movsd %xmm1, %xmm2
+
+ lea .L__log_128_table(%rip), %r9
+ movsd .L__real_log2(%rip), %xmm5
+ movsd (%r9,%r8,8), %xmm0
+
+ # poly
+ mulsd %xmm2, %xmm1
+ movsd .L__real_1_over_4(%rip), %xmm4
+ movsd .L__real_1_over_2(%rip), %xmm3
+ mulsd %xmm2, %xmm4
+ mulsd %xmm2, %xmm3
+ mulsd %xmm2, %xmm1
+ addsd .L__real_1_over_3(%rip), %xmm4
+ addsd .L__real_1_over_1(%rip), %xmm3
+ mulsd %xmm1, %xmm4
+ mulsd %xmm2, %xmm3
+ addsd %xmm4, %xmm3
+
+ mulsd %xmm6, %xmm5
+ subsd %xmm3, %xmm0
+ addsd %xmm5, %xmm0
+
+ movsd save_y(%rsp), %xmm7
+ mulsd %xmm7, %xmm0
+
+ # v = y * ln(x)
+ # xmm0 - v
+
+ # -----------------------------
+ # compute exp( y * ln(x) ) here
+ # -----------------------------
+
+ # x * (32/ln(2))
+ movsd .L__real_32_by_log2(%rip), %xmm7
+ movsd %xmm0, p_temp_exp(%rsp)
+ mulsd %xmm0, %xmm7
+ mov p_temp_exp(%rsp), %rdx
+
+ # v < 128*ln(2), ( v * (32/ln(2)) ) < 32*128
+ # v >= -150*ln(2), ( v * (32/ln(2)) ) >= 32*(-150)
+ comisd .L__real_p4096(%rip), %xmm7
+ jae .L__process_result_inf
+
+ comisd .L__real_m4768(%rip), %xmm7
+ jb .L__process_result_zero
+
+ # n = int( v * (32/ln(2)) )
+ cvtpd2dq %xmm7, %xmm4
+ lea .L__two_to_jby32_table(%rip), %r10
+ cvtdq2pd %xmm4, %xmm1
+
+ # r = x - n * ln(2)/32
+ movsd .L__real_log2_by_32(%rip), %xmm2
+ mulsd %xmm1, %xmm2
+ movd %xmm4, %ecx
+ mov $0x1f, %rax
+ and %ecx, %eax
+ subsd %xmm2, %xmm0
+ movsd %xmm0, %xmm1
+
+ # m = (n - j) / 32
+ sub %eax, %ecx
+ sar $5, %ecx
+
+ # q
+ mulsd %xmm0, %xmm1
+ movsd .L__real_1_by_24(%rip), %xmm4
+ movsd .L__real_1_by_2(%rip), %xmm3
+ mulsd %xmm0, %xmm4
+ mulsd %xmm0, %xmm3
+ mulsd %xmm0, %xmm1
+ addsd .L__real_1_by_6(%rip), %xmm4
+ addsd .L__real_1_by_1(%rip), %xmm3
+ mulsd %xmm1, %xmm4
+ mulsd %xmm0, %xmm3
+ addsd %xmm4, %xmm3
+ movsd %xmm3, %xmm0
+
+ add $1023, %rcx
+ shl $52, %rcx
+
+ # (f)*(1+q)
+ movsd (%r10,%rax,8), %xmm1
+ mulsd %xmm1, %xmm0
+ addsd %xmm1, %xmm0
+
+ mov %rcx, p_temp_exp(%rsp)
+ mulsd p_temp_exp(%rsp), %xmm0
+ cvtsd2ss %xmm0, %xmm0
+ orps negate_result(%rsp), %xmm0
+
+.L__final_check:
+ add $stack_size, %rsp
+ ret
+
+.p2align 4,,15
+.L__process_result_zero:
+ mov .L__f32_real_zero(%rip), %r11d
+ or negate_result(%rsp), %r11d
+ jmp .L__z_is_zero_or_inf
+
+.p2align 4,,15
+.L__process_result_inf:
+ mov .L__f32_real_inf(%rip), %r11d
+ or negate_result(%rsp), %r11d
+ jmp .L__z_is_zero_or_inf
+
+
+.p2align 4,,15
+.L__x_is_neg:
+
+ mov .L__f32_exp_mask(%rip), %r10d
+ and %r8d, %r10d
+ cmp .L__f32_ay_max_bound(%rip), %r10d
+ jg .L__ay_is_very_large
+
+ # determine if y is an integer
+ mov .L__f32_exp_mant_mask(%rip), %r10d
+ and %r8d, %r10d
+ mov %r10d, %r11d
+ mov .L__f32_exp_shift(%rip), %ecx
+ shr %cl, %r10d
+ sub .L__f32_exp_bias(%rip), %r10d
+ js .L__x_is_neg_y_is_not_int
+
+ mov .L__f32_exp_mant_mask(%rip), %eax
+ and %edx, %eax
+ mov %eax, save_ax(%rsp)
+
+ cmp .L__yexp_24(%rip), %r10d
+ mov %r10d, %ecx
+ jg .L__continue_after_y_int_check
+
+ mov .L__f32_mant_full(%rip), %r9d
+ shr %cl, %r9d
+ and %r11d, %r9d
+ jnz .L__x_is_neg_y_is_not_int
+
+ mov .L__f32_1_before_mant(%rip), %r9d
+ shr %cl, %r9d
+ and %r11d, %r9d
+ jz .L__continue_after_y_int_check
+
+ mov .L__f32_sign_mask(%rip), %eax
+ mov %eax, negate_result(%rsp)
+
+.L__continue_after_y_int_check:
+
+ cmp .L__f32_neg_zero(%rip), %edx
+ je .L__x_is_zero
+
+ cmp .L__f32_neg_one(%rip), %edx
+ je .L__x_is_neg_one
+
+ mov .L__f32_exp_mask(%rip), %r9d
+ and %edx, %r9d
+ cmp .L__f32_exp_mask(%rip), %r9d
+ je .L__x_is_inf_or_nan
+
+ movss save_ax(%rsp), %xmm0
+ jmp .L__log_x
+
+.p2align 4,,15
+.L__x_is_pos_one:
+ xor %eax, %eax
+ mov .L__f32_exp_mask(%rip), %r10d
+ and %r8d, %r10d
+ cmp .L__f32_exp_mask(%rip), %r10d
+ cmove %r8d, %eax
+ mov .L__f32_mant_mask(%rip), %r10d
+ and %eax, %r10d
+ jz .L__final_check
+
+ mov .L__f32_qnan_set(%rip), %r10d
+ and %r8d, %r10d
+ jnz .L__final_check
+
+ movss save_x(%rsp), %xmm0
+ movss save_y(%rsp), %xmm1
+ movss .L__f32_pos_one(%rip), %xmm2
+ mov .L__flag_x_one_y_snan(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__y_is_zero:
+
+ xor %eax, %eax
+ mov .L__f32_exp_mask(%rip), %r9d
+ mov .L__f32_pos_one(%rip), %r11d
+ and %edx, %r9d
+ cmp .L__f32_exp_mask(%rip), %r9d
+ cmove %edx, %eax
+ mov .L__f32_mant_mask(%rip), %r9d
+ and %eax, %r9d
+ jnz .L__x_is_nan
+
+ movss .L__f32_pos_one(%rip), %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__y_is_one:
+ xor %eax, %eax
+ mov %edx, %r11d
+ mov .L__f32_exp_mask(%rip), %r9d
+ or .L__f32_qnan_set(%rip), %r11d
+ and %edx, %r9d
+ cmp .L__f32_exp_mask(%rip), %r9d
+ cmove %edx, %eax
+ mov .L__f32_mant_mask(%rip), %r9d
+ and %eax, %r9d
+ jnz .L__x_is_nan
+
+ movd %edx, %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_neg_one:
+ mov .L__f32_pos_one(%rip), %edx
+ or negate_result(%rsp), %edx
+ xor %eax, %eax
+ mov %r8d, %r11d
+ mov .L__f32_exp_mask(%rip), %r10d
+ or .L__f32_qnan_set(%rip), %r11d
+ and %r8d, %r10d
+ cmp .L__f32_exp_mask(%rip), %r10d
+ cmove %r8d, %eax
+ mov .L__f32_mant_mask(%rip), %r10d
+ and %eax, %r10d
+ jnz .L__y_is_nan
+
+ movd %edx, %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_neg_y_is_not_int:
+ mov .L__f32_exp_mask(%rip), %r9d
+ and %edx, %r9d
+ cmp .L__f32_exp_mask(%rip), %r9d
+ je .L__x_is_inf_or_nan
+
+ cmp .L__f32_neg_zero(%rip), %edx
+ je .L__x_is_zero
+
+ movss save_x(%rsp), %xmm0
+ movss save_y(%rsp), %xmm1
+ movss .L__f32_qnan(%rip), %xmm2
+ mov .L__flag_x_neg_y_notint(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__ay_is_very_large:
+ mov .L__f32_exp_mask(%rip), %r9d
+ and %edx, %r9d
+ cmp .L__f32_exp_mask(%rip), %r9d
+ je .L__x_is_inf_or_nan
+
+ mov .L__f32_exp_mant_mask(%rip), %r9d
+ and %edx, %r9d
+ jz .L__x_is_zero
+
+ cmp .L__f32_neg_one(%rip), %edx
+ je .L__x_is_neg_one
+
+ mov %edx, %r9d
+ and .L__f32_exp_mant_mask(%rip), %r9d
+ cmp .L__f32_pos_one(%rip), %r9d
+ jl .L__ax_lt1_y_is_large_or_inf_or_nan
+
+ jmp .L__ax_gt1_y_is_large_or_inf_or_nan
+
+.p2align 4,,15
+.L__x_is_zero:
+ mov .L__f32_exp_mask(%rip), %r10d
+ xor %eax, %eax
+ and %r8d, %r10d
+ cmp .L__f32_exp_mask(%rip), %r10d
+ je .L__x_is_zero_y_is_inf_or_nan
+
+ mov .L__f32_sign_mask(%rip), %r10d
+ and %r8d, %r10d
+ cmovnz .L__f32_pos_inf(%rip), %eax
+ jnz .L__x_is_zero_z_is_inf
+
+ movd %eax, %xmm0
+ orps negate_result(%rsp), %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_zero_z_is_inf:
+
+ movss save_x(%rsp), %xmm0
+ movss save_y(%rsp), %xmm1
+ movd %eax, %xmm2
+ orps negate_result(%rsp), %xmm2
+ mov .L__flag_x_zero_z_inf(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_zero_y_is_inf_or_nan:
+ mov %r8d, %r11d
+ cmp .L__f32_neg_inf(%rip), %r8d
+ cmove .L__f32_pos_inf(%rip), %eax
+ je .L__x_is_zero_z_is_inf
+
+ or .L__f32_qnan_set(%rip), %r11d
+ mov .L__f32_mant_mask(%rip), %r10d
+ and %r8d, %r10d
+ jnz .L__y_is_nan
+
+ movd %eax, %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_inf_or_nan:
+ xor %r11d, %r11d
+ mov .L__f32_sign_mask(%rip), %r10d
+ and %r8d, %r10d
+ cmovz .L__f32_pos_inf(%rip), %r11d
+ mov %edx, %eax
+ mov .L__f32_mant_mask(%rip), %r9d
+ or .L__f32_qnan_set(%rip), %eax
+ and %edx, %r9d
+ cmovnz %eax, %r11d
+ jnz .L__x_is_nan
+
+ xor %eax, %eax
+ mov %r8d, %r9d
+ mov .L__f32_exp_mask(%rip), %r10d
+ or .L__f32_qnan_set(%rip), %r9d
+ and %r8d, %r10d
+ cmp .L__f32_exp_mask(%rip), %r10d
+ cmove %r8d, %eax
+ mov .L__f32_mant_mask(%rip), %r10d
+ and %eax, %r10d
+ cmovnz %r9d, %r11d
+ jnz .L__y_is_nan
+
+ movd %r11d, %xmm0
+ orps negate_result(%rsp), %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__ay_is_very_small:
+ movss .L__f32_pos_one(%rip), %xmm0
+ addss %xmm1, %xmm0
+ jmp .L__final_check
+
+
+.p2align 4,,15
+.L__ax_lt1_y_is_large_or_inf_or_nan:
+ xor %r11d, %r11d
+ mov .L__f32_sign_mask(%rip), %r10d
+ and %r8d, %r10d
+ cmovnz .L__f32_pos_inf(%rip), %r11d
+ jmp .L__adjust_for_nan
+
+.p2align 4,,15
+.L__ax_gt1_y_is_large_or_inf_or_nan:
+ xor %r11d, %r11d
+ mov .L__f32_sign_mask(%rip), %r10d
+ and %r8d, %r10d
+ cmovz .L__f32_pos_inf(%rip), %r11d
+
+.p2align 4,,15
+.L__adjust_for_nan:
+
+ xor %eax, %eax
+ mov %r8d, %r9d
+ mov .L__f32_exp_mask(%rip), %r10d
+ or .L__f32_qnan_set(%rip), %r9d
+ and %r8d, %r10d
+ cmp .L__f32_exp_mask(%rip), %r10d
+ cmove %r8d, %eax
+ mov .L__f32_mant_mask(%rip), %r10d
+ and %eax, %r10d
+ cmovnz %r9d, %r11d
+ jnz .L__y_is_nan
+
+ test %eax, %eax
+ jnz .L__y_is_inf
+
+.p2align 4,,15
+.L__z_is_zero_or_inf:
+
+ mov .L__flag_z_zero(%rip), %edi
+ test %r11d, %r11d
+ cmovnz .L__flag_z_inf(%rip), %edi
+
+ movss save_x(%rsp), %xmm0
+ movss save_y(%rsp), %xmm1
+ movd %r11d, %xmm2
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__y_is_inf:
+
+ movd %r11d, %xmm0
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_nan:
+
+ xor %eax, %eax
+ mov .L__f32_exp_mask(%rip), %r10d
+ and %r8d, %r10d
+ cmp .L__f32_exp_mask(%rip), %r10d
+ cmove %r8d, %eax
+ mov .L__f32_mant_mask(%rip), %r10d
+ and %eax, %r10d
+ jnz .L__x_is_nan_y_is_nan
+
+ mov .L__f32_qnan_set(%rip), %r9d
+ and %edx, %r9d
+ movd %r11d, %xmm0
+ jnz .L__final_check
+
+ movss save_x(%rsp), %xmm0
+ movss save_y(%rsp), %xmm1
+ movd %r11d, %xmm2
+ mov .L__flag_x_nan(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__y_is_nan:
+
+ mov .L__f32_qnan_set(%rip), %r10d
+ and %r8d, %r10d
+ movd %r11d, %xmm0
+ jnz .L__final_check
+
+ movss save_x(%rsp), %xmm0
+ movss save_y(%rsp), %xmm1
+ movd %r11d, %xmm2
+ mov .L__flag_y_nan(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.p2align 4,,15
+.L__x_is_nan_y_is_nan:
+
+ mov .L__f32_qnan_set(%rip), %r9d
+ and %edx, %r9d
+ jz .L__continue_xy_nan
+
+ mov .L__f32_qnan_set(%rip), %r10d
+ and %r8d, %r10d
+ jz .L__continue_xy_nan
+
+ movd %r11d, %xmm0
+ jmp .L__final_check
+
+.L__continue_xy_nan:
+ movss save_x(%rsp), %xmm0
+ movss save_y(%rsp), %xmm1
+ movd %r11d, %xmm2
+ mov .L__flag_x_nan_y_nan(%rip), %edi
+
+ call fname_special
+ jmp .L__final_check
+
+.data
+
+.align 16
+
+# these codes and the ones in the corresponding .c file have to match
+.L__flag_x_one_y_snan: .long 1
+.L__flag_x_zero_z_inf: .long 2
+.L__flag_x_nan: .long 3
+.L__flag_y_nan: .long 4
+.L__flag_x_nan_y_nan: .long 5
+.L__flag_x_neg_y_notint: .long 6
+.L__flag_z_zero: .long 7
+.L__flag_z_denormal: .long 8
+.L__flag_z_inf: .long 9
+
+.align 16
+
+.L__f32_ay_max_bound: .long 0x4f000000
+.L__f32_ay_min_bound: .long 0x2e800000
+.L__f32_sign_mask: .long 0x80000000
+.L__f32_sign_and_exp_mask: .long 0x0ff800000
+.L__f32_exp_mask: .long 0x7f800000
+.L__f32_neg_inf: .long 0x0ff800000
+.L__f32_pos_inf: .long 0x7f800000
+.L__f32_pos_one: .long 0x3f800000
+.L__f32_pos_zero: .long 0x00000000
+.L__f32_exp_mant_mask: .long 0x7fffffff
+.L__f32_mant_mask: .long 0x007fffff
+
+.L__f32_neg_qnan: .long 0x0ffc00000
+.L__f32_qnan: .long 0x7fc00000
+.L__f32_qnan_set: .long 0x00400000
+
+.L__f32_neg_one: .long 0x0bf800000
+.L__f32_neg_zero: .long 0x80000000
+
+.L__f32_real_one: .long 0x3f800000
+.L__f32_real_zero: .long 0x00000000
+.L__f32_real_inf: .long 0x7f800000
+
+.L__yexp_24: .long 0x00000018
+
+.L__f32_exp_shift: .long 0x00000017
+.L__f32_exp_bias: .long 0x0000007f
+.L__f32_mant_full: .long 0x007fffff
+.L__f32_1_before_mant: .long 0x00800000
+
+.align 16
+
+.L__mask_mant_all7: .quad 0x000fe00000000000
+.L__mask_mant8: .quad 0x0000100000000000
+
+#---------------------
+# log data
+#---------------------
+
+.align 16
+
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0000000000000000
+.L__real_inf: .quad 0x7ff0000000000000 # +inf
+ .quad 0x0000000000000000
+.L__real_nan: .quad 0x7ff8000000000000 # NaN
+ .quad 0x0000000000000000
+.L__real_mant: .quad 0x000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000000000000000
+.L__mask_1023: .quad 0x00000000000003ff
+ .quad 0x0000000000000000
+
+
+.L__real_log2: .quad 0x3fe62e42fefa39ef
+ .quad 0x0000000000000000
+
+.L__real_two: .quad 0x4000000000000000 # 2
+ .quad 0x0000000000000000
+
+.L__real_one: .quad 0x3ff0000000000000 # 1
+ .quad 0x0000000000000000
+
+.L__real_half: .quad 0x3fe0000000000000 # 1/2
+ .quad 0x0000000000000000
+
+.L__real_1_over_1: .quad 0x3ff0000000000000
+ .quad 0x0000000000000000
+.L__real_1_over_2: .quad 0x3fe0000000000000
+ .quad 0x0000000000000000
+.L__real_1_over_3: .quad 0x3fd5555555555555
+ .quad 0x0000000000000000
+.L__real_1_over_4: .quad 0x3fd0000000000000
+ .quad 0x0000000000000000
+
+
+.align 16
+.L__log_128_table:
+ .quad 0x0000000000000000
+ .quad 0x3f7fe02a6b106789
+ .quad 0x3f8fc0a8b0fc03e4
+ .quad 0x3f97b91b07d5b11b
+ .quad 0x3f9f829b0e783300
+ .quad 0x3fa39e87b9febd60
+ .quad 0x3fa77458f632dcfc
+ .quad 0x3fab42dd711971bf
+ .quad 0x3faf0a30c01162a6
+ .quad 0x3fb16536eea37ae1
+ .quad 0x3fb341d7961bd1d1
+ .quad 0x3fb51b073f06183f
+ .quad 0x3fb6f0d28ae56b4c
+ .quad 0x3fb8c345d6319b21
+ .quad 0x3fba926d3a4ad563
+ .quad 0x3fbc5e548f5bc743
+ .quad 0x3fbe27076e2af2e6
+ .quad 0x3fbfec9131dbeabb
+ .quad 0x3fc0d77e7cd08e59
+ .quad 0x3fc1b72ad52f67a0
+ .quad 0x3fc29552f81ff523
+ .quad 0x3fc371fc201e8f74
+ .quad 0x3fc44d2b6ccb7d1e
+ .quad 0x3fc526e5e3a1b438
+ .quad 0x3fc5ff3070a793d4
+ .quad 0x3fc6d60fe719d21d
+ .quad 0x3fc7ab890210d909
+ .quad 0x3fc87fa06520c911
+ .quad 0x3fc9525a9cf456b4
+ .quad 0x3fca23bc1fe2b563
+ .quad 0x3fcaf3c94e80bff3
+ .quad 0x3fcbc286742d8cd6
+ .quad 0x3fcc8ff7c79a9a22
+ .quad 0x3fcd5c216b4fbb91
+ .quad 0x3fce27076e2af2e6
+ .quad 0x3fcef0adcbdc5936
+ .quad 0x3fcfb9186d5e3e2b
+ .quad 0x3fd0402594b4d041
+ .quad 0x3fd0a324e27390e3
+ .quad 0x3fd1058bf9ae4ad5
+ .quad 0x3fd1675cababa60e
+ .quad 0x3fd1c898c16999fb
+ .quad 0x3fd22941fbcf7966
+ .quad 0x3fd2895a13de86a3
+ .quad 0x3fd2e8e2bae11d31
+ .quad 0x3fd347dd9a987d55
+ .quad 0x3fd3a64c556945ea
+ .quad 0x3fd404308686a7e4
+ .quad 0x3fd4618bc21c5ec2
+ .quad 0x3fd4be5f957778a1
+ .quad 0x3fd51aad872df82d
+ .quad 0x3fd5767717455a6c
+ .quad 0x3fd5d1bdbf5809ca
+ .quad 0x3fd62c82f2b9c795
+ .quad 0x3fd686c81e9b14af
+ .quad 0x3fd6e08eaa2ba1e4
+ .quad 0x3fd739d7f6bbd007
+ .quad 0x3fd792a55fdd47a2
+ .quad 0x3fd7eaf83b82afc3
+ .quad 0x3fd842d1da1e8b17
+ .quad 0x3fd89a3386c1425b
+ .quad 0x3fd8f11e873662c8
+ .quad 0x3fd947941c2116fb
+ .quad 0x3fd99d958117e08b
+ .quad 0x3fd9f323ecbf984c
+ .quad 0x3fda484090e5bb0a
+ .quad 0x3fda9cec9a9a084a
+ .quad 0x3fdaf1293247786b
+ .quad 0x3fdb44f77bcc8f63
+ .quad 0x3fdb9858969310fb
+ .quad 0x3fdbeb4d9da71b7c
+ .quad 0x3fdc3dd7a7cdad4d
+ .quad 0x3fdc8ff7c79a9a22
+ .quad 0x3fdce1af0b85f3eb
+ .quad 0x3fdd32fe7e00ebd5
+ .quad 0x3fdd83e7258a2f3e
+ .quad 0x3fddd46a04c1c4a1
+ .quad 0x3fde24881a7c6c26
+ .quad 0x3fde744261d68788
+ .quad 0x3fdec399d2468cc0
+ .quad 0x3fdf128f5faf06ed
+ .quad 0x3fdf6123fa7028ac
+ .quad 0x3fdfaf588f78f31f
+ .quad 0x3fdffd2e0857f498
+ .quad 0x3fe02552a5a5d0ff
+ .quad 0x3fe04bdf9da926d2
+ .quad 0x3fe0723e5c1cdf40
+ .quad 0x3fe0986f4f573521
+ .quad 0x3fe0be72e4252a83
+ .quad 0x3fe0e44985d1cc8c
+ .quad 0x3fe109f39e2d4c97
+ .quad 0x3fe12f719593efbc
+ .quad 0x3fe154c3d2f4d5ea
+ .quad 0x3fe179eabbd899a1
+ .quad 0x3fe19ee6b467c96f
+ .quad 0x3fe1c3b81f713c25
+ .quad 0x3fe1e85f5e7040d0
+ .quad 0x3fe20cdcd192ab6e
+ .quad 0x3fe23130d7bebf43
+ .quad 0x3fe2555bce98f7cb
+ .quad 0x3fe2795e1289b11b
+ .quad 0x3fe29d37fec2b08b
+ .quad 0x3fe2c0e9ed448e8c
+ .quad 0x3fe2e47436e40268
+ .quad 0x3fe307d7334f10be
+ .quad 0x3fe32b1339121d71
+ .quad 0x3fe34e289d9ce1d3
+ .quad 0x3fe37117b54747b6
+ .quad 0x3fe393e0d3562a1a
+ .quad 0x3fe3b68449fffc23
+ .quad 0x3fe3d9026a7156fb
+ .quad 0x3fe3fb5b84d16f42
+ .quad 0x3fe41d8fe84672ae
+ .quad 0x3fe43f9fe2f9ce67
+ .quad 0x3fe4618bc21c5ec2
+ .quad 0x3fe48353d1ea88df
+ .quad 0x3fe4a4f85db03ebb
+ .quad 0x3fe4c679afccee3a
+ .quad 0x3fe4e7d811b75bb1
+ .quad 0x3fe50913cc01686b
+ .quad 0x3fe52a2d265bc5ab
+ .quad 0x3fe54b2467999498
+ .quad 0x3fe56bf9d5b3f399
+ .quad 0x3fe58cadb5cd7989
+ .quad 0x3fe5ad404c359f2d
+ .quad 0x3fe5cdb1dc6c1765
+ .quad 0x3fe5ee02a9241675
+ .quad 0x3fe60e32f44788d9
+ .quad 0x3fe62e42fefa39ef
+
+.align 16
+.L__log_F_inv:
+ .quad 0x4000000000000000
+ .quad 0x3fffc07f01fc07f0
+ .quad 0x3fff81f81f81f820
+ .quad 0x3fff44659e4a4271
+ .quad 0x3fff07c1f07c1f08
+ .quad 0x3ffecc07b301ecc0
+ .quad 0x3ffe9131abf0b767
+ .quad 0x3ffe573ac901e574
+ .quad 0x3ffe1e1e1e1e1e1e
+ .quad 0x3ffde5d6e3f8868a
+ .quad 0x3ffdae6076b981db
+ .quad 0x3ffd77b654b82c34
+ .quad 0x3ffd41d41d41d41d
+ .quad 0x3ffd0cb58f6ec074
+ .quad 0x3ffcd85689039b0b
+ .quad 0x3ffca4b3055ee191
+ .quad 0x3ffc71c71c71c71c
+ .quad 0x3ffc3f8f01c3f8f0
+ .quad 0x3ffc0e070381c0e0
+ .quad 0x3ffbdd2b899406f7
+ .quad 0x3ffbacf914c1bad0
+ .quad 0x3ffb7d6c3dda338b
+ .quad 0x3ffb4e81b4e81b4f
+ .quad 0x3ffb2036406c80d9
+ .quad 0x3ffaf286bca1af28
+ .quad 0x3ffac5701ac5701b
+ .quad 0x3ffa98ef606a63be
+ .quad 0x3ffa6d01a6d01a6d
+ .quad 0x3ffa41a41a41a41a
+ .quad 0x3ffa16d3f97a4b02
+ .quad 0x3ff9ec8e951033d9
+ .quad 0x3ff9c2d14ee4a102
+ .quad 0x3ff999999999999a
+ .quad 0x3ff970e4f80cb872
+ .quad 0x3ff948b0fcd6e9e0
+ .quad 0x3ff920fb49d0e229
+ .quad 0x3ff8f9c18f9c18fa
+ .quad 0x3ff8d3018d3018d3
+ .quad 0x3ff8acb90f6bf3aa
+ .quad 0x3ff886e5f0abb04a
+ .quad 0x3ff8618618618618
+ .quad 0x3ff83c977ab2bedd
+ .quad 0x3ff8181818181818
+ .quad 0x3ff7f405fd017f40
+ .quad 0x3ff7d05f417d05f4
+ .quad 0x3ff7ad2208e0ecc3
+ .quad 0x3ff78a4c8178a4c8
+ .quad 0x3ff767dce434a9b1
+ .quad 0x3ff745d1745d1746
+ .quad 0x3ff724287f46debc
+ .quad 0x3ff702e05c0b8170
+ .quad 0x3ff6e1f76b4337c7
+ .quad 0x3ff6c16c16c16c17
+ .quad 0x3ff6a13cd1537290
+ .quad 0x3ff6816816816817
+ .quad 0x3ff661ec6a5122f9
+ .quad 0x3ff642c8590b2164
+ .quad 0x3ff623fa77016240
+ .quad 0x3ff6058160581606
+ .quad 0x3ff5e75bb8d015e7
+ .quad 0x3ff5c9882b931057
+ .quad 0x3ff5ac056b015ac0
+ .quad 0x3ff58ed2308158ed
+ .quad 0x3ff571ed3c506b3a
+ .quad 0x3ff5555555555555
+ .quad 0x3ff5390948f40feb
+ .quad 0x3ff51d07eae2f815
+ .quad 0x3ff5015015015015
+ .quad 0x3ff4e5e0a72f0539
+ .quad 0x3ff4cab88725af6e
+ .quad 0x3ff4afd6a052bf5b
+ .quad 0x3ff49539e3b2d067
+ .quad 0x3ff47ae147ae147b
+ .quad 0x3ff460cbc7f5cf9a
+ .quad 0x3ff446f86562d9fb
+ .quad 0x3ff42d6625d51f87
+ .quad 0x3ff4141414141414
+ .quad 0x3ff3fb013fb013fb
+ .quad 0x3ff3e22cbce4a902
+ .quad 0x3ff3c995a47babe7
+ .quad 0x3ff3b13b13b13b14
+ .quad 0x3ff3991c2c187f63
+ .quad 0x3ff3813813813814
+ .quad 0x3ff3698df3de0748
+ .quad 0x3ff3521cfb2b78c1
+ .quad 0x3ff33ae45b57bcb2
+ .quad 0x3ff323e34a2b10bf
+ .quad 0x3ff30d190130d190
+ .quad 0x3ff2f684bda12f68
+ .quad 0x3ff2e025c04b8097
+ .quad 0x3ff2c9fb4d812ca0
+ .quad 0x3ff2b404ad012b40
+ .quad 0x3ff29e4129e4129e
+ .quad 0x3ff288b01288b013
+ .quad 0x3ff27350b8812735
+ .quad 0x3ff25e22708092f1
+ .quad 0x3ff2492492492492
+ .quad 0x3ff23456789abcdf
+ .quad 0x3ff21fb78121fb78
+ .quad 0x3ff20b470c67c0d9
+ .quad 0x3ff1f7047dc11f70
+ .quad 0x3ff1e2ef3b3fb874
+ .quad 0x3ff1cf06ada2811d
+ .quad 0x3ff1bb4a4046ed29
+ .quad 0x3ff1a7b9611a7b96
+ .quad 0x3ff19453808ca29c
+ .quad 0x3ff1811811811812
+ .quad 0x3ff16e0689427379
+ .quad 0x3ff15b1e5f75270d
+ .quad 0x3ff1485f0e0acd3b
+ .quad 0x3ff135c81135c811
+ .quad 0x3ff12358e75d3033
+ .quad 0x3ff1111111111111
+ .quad 0x3ff0fef010fef011
+ .quad 0x3ff0ecf56be69c90
+ .quad 0x3ff0db20a88f4696
+ .quad 0x3ff0c9714fbcda3b
+ .quad 0x3ff0b7e6ec259dc8
+ .quad 0x3ff0a6810a6810a7
+ .quad 0x3ff0953f39010954
+ .quad 0x3ff0842108421084
+ .quad 0x3ff073260a47f7c6
+ .quad 0x3ff0624dd2f1a9fc
+ .quad 0x3ff05197f7d73404
+ .quad 0x3ff0410410410410
+ .quad 0x3ff03091b51f5e1a
+ .quad 0x3ff0204081020408
+ .quad 0x3ff0101010101010
+ .quad 0x3ff0000000000000
+
+#---------------------
+# exp data
+#---------------------
+
+.align 16
+
+.L__real_zero: .quad 0x0000000000000000
+ .quad 0
+
+.L__real_p4096: .quad 0x40b0000000000000
+ .quad 0
+.L__real_m4768: .quad 0x0c0b2a00000000000
+ .quad 0
+
+.L__real_32_by_log2: .quad 0x40471547652b82fe # 32/ln(2)
+ .quad 0
+.L__real_log2_by_32: .quad 0x3f962e42fefa39ef # log2_by_32
+ .quad 0
+
+.L__real_1_by_24: .quad 0x3fa5555555555555 # 1/24
+ .quad 0
+.L__real_1_by_6: .quad 0x3fc5555555555555 # 1/6
+ .quad 0
+.L__real_1_by_2: .quad 0x3fe0000000000000 # 1/2
+ .quad 0
+.L__real_1_by_1: .quad 0x3ff0000000000000 # 1
+ .quad 0
+
+.align 16
+
+.L__two_to_jby32_table:
+ .quad 0x3ff0000000000000
+ .quad 0x3ff059b0d3158574
+ .quad 0x3ff0b5586cf9890f
+ .quad 0x3ff11301d0125b51
+ .quad 0x3ff172b83c7d517b
+ .quad 0x3ff1d4873168b9aa
+ .quad 0x3ff2387a6e756238
+ .quad 0x3ff29e9df51fdee1
+ .quad 0x3ff306fe0a31b715
+ .quad 0x3ff371a7373aa9cb
+ .quad 0x3ff3dea64c123422
+ .quad 0x3ff44e086061892d
+ .quad 0x3ff4bfdad5362a27
+ .quad 0x3ff5342b569d4f82
+ .quad 0x3ff5ab07dd485429
+ .quad 0x3ff6247eb03a5585
+ .quad 0x3ff6a09e667f3bcd
+ .quad 0x3ff71f75e8ec5f74
+ .quad 0x3ff7a11473eb0187
+ .quad 0x3ff82589994cce13
+ .quad 0x3ff8ace5422aa0db
+ .quad 0x3ff93737b0cdc5e5
+ .quad 0x3ff9c49182a3f090
+ .quad 0x3ffa5503b23e255d
+ .quad 0x3ffae89f995ad3ad
+ .quad 0x3ffb7f76f2fb5e47
+ .quad 0x3ffc199bdd85529c
+ .quad 0x3ffcb720dcef9069
+ .quad 0x3ffd5818dcfba487
+ .quad 0x3ffdfc97337b9b5f
+ .quad 0x3ffea4afa2a490da
+ .quad 0x3fff50765b6e4540
+
+
diff --git a/src/gas/remainder.S b/src/gas/remainder.S
new file mode 100644
index 0000000..173da80
--- /dev/null
+++ b/src/gas/remainder.S
@@ -0,0 +1,256 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# remainder.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+# double remainder(double x,double y);
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(remainder)
+#define fname_special _remainder_special
+
+
+# local variable storage offsets
+.equ temp_x, 0x0
+.equ temp_y, 0x10
+.equ stack_size, 0x28
+
+.equ stack_size, 0x80
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+ movd %xmm0,%r8
+ movd %xmm1,%r9
+ movsd %xmm0,%xmm2
+ movsd %xmm1,%xmm3
+ movsd %xmm0,%xmm4
+ movsd %xmm1,%xmm5
+ mov .L__exp_mask_64(%rip), %r10
+ and %r10,%r8
+ and %r10,%r9
+ xor %r10,%r10
+ ror $52, %r8
+ ror $52, %r9
+ cmp $0,%r8
+ jz .L__LargeExpDiffComputation
+ cmp $0,%r9
+ jz .L__LargeExpDiffComputation
+ sub %r9,%r8 #
+ cmp $52,%r8
+ jge .L__LargeExpDiffComputation
+ pand .L__Nan_64(%rip),%xmm4
+ pand .L__Nan_64(%rip),%xmm5
+ comisd %xmm5,%xmm4
+ jp .L__InputIsNaN # if either of xmm1 or xmm0 is a NaN then
+ # parity flag is set
+ jz .L__Input_Is_Equal
+ jbe .L__ReturnImmediate
+ cmp $0x7FF,%r8
+ jz .L__Dividend_Is_Infinity
+
+ #calculation without using the x87 FPU
+.L__DirectComputation:
+ movapd %xmm4,%xmm2
+ movapd %xmm5,%xmm3
+ divsd %xmm3,%xmm2
+ cvttsd2siq %xmm2,%r8
+ mov %r8,%r10
+ and $0X01,%r10
+ cvtsi2sdq %r8,%xmm2
+
+ #multiplication in QUAD Precision
+ #Since the below commented multiplication resulted in an error
+ #we had to implement a quad precision multiplication
+ #logic behind Quad Precision Multiplication
+ #x = hx + tx by setting x's last 27 bits to null
+ #y = hy + ty similar to x
+ movapd .L__27bit_andingmask_64(%rip),%xmm4
+ movapd %xmm5,%xmm1 # x
+ movapd %xmm2,%xmm6 # y
+ movapd %xmm2,%xmm7 # z = xmm7
+ mulpd %xmm5,%xmm7 # z = x*y
+ andpd %xmm4,%xmm1
+ andpd %xmm4,%xmm2
+ subsd %xmm1,%xmm5 # xmm1 = hx xmm5 = tx
+ subsd %xmm2,%xmm6 # xmm2 = hy xmm6 = ty
+
+ movapd %xmm1,%xmm4 # copy hx
+ mulsd %xmm2,%xmm4 # xmm4 = hx*hy
+ subsd %xmm7,%xmm4 # xmm4 = (hx*hy - z)
+ mulsd %xmm6,%xmm1 # xmm1 = hx * ty
+ addsd %xmm1,%xmm4 # xmm4 = ((hx * hy - *z) + hx * ty)
+ mulsd %xmm5,%xmm2 # xmm2 = tx * hy
+ addsd %xmm2,%xmm4 # xmm4 = (((hx * hy - *z) + hx * ty) + tx * hy)
+ mulsd %xmm5,%xmm6 # xmm6 = tx * ty
+ addsd %xmm4,%xmm6 # xmm6 = (((hx * hy - *z) + hx * ty) + tx * hy) + tx * ty;
+ #xmm6 and xmm7 contain the quad precision result
+ #v = dx - c;
+ movapd %xmm0,%xmm1 # copy the input number
+ pand .L__Nan_64(%rip),%xmm1
+ movapd %xmm1,%xmm2 # xmm2 = dx = xmm1
+ subsd %xmm7,%xmm1 # v = dx - c
+ subsd %xmm1,%xmm2 # (dx - v)
+ subsd %xmm7,%xmm2 # ((dx - v) - c)
+ subsd %xmm6,%xmm2 # (((dx - v) - c) - cc)
+ addsd %xmm1,%xmm2 # xmm2 = dx = v + (((dx - v) - c) - cc)
+ # xmm3 = w
+ movapd %xmm2,%xmm4
+ movapd %xmm3,%xmm5
+ addsd %xmm4,%xmm4 # xmm4 = dx + dx
+ comisd %xmm4,%xmm3 # if (dx + dx > w)
+ jb .L__Substractw
+ mulpd .L__ZeroPointFive(%rip),%xmm5 # xmm5 = 0.5 * w
+ comisd %xmm2,%xmm5 # if (dx > 0.5 * w)
+ jb .L__Substractw
+ cmp $0x01,%r10 # If the quotient is an odd number
+ jnz .L__Finish
+ comisd %xmm4,%xmm3 #if (todd && (dx + dx == w)) then subtract w
+ jz .L__Substractw
+ comisd %xmm0,%xmm5 #if (todd && (dx == 0.5 * w)) then subtract w
+ jnz .L__Finish
+
+.L__Substractw:
+ subsd %xmm3,%xmm2 # dx -= w
+
+# The following code checks the sign of the input number and then calculate the return Value
+# return x < 0.0? -dx : dx;
+.L__Finish:
+ comisd .L__Zero_64(%rip), %xmm0
+ ja .L__Not_Negative_Number1
+
+.L__Negative_Number1:
+ movapd .L__Zero_64(%rip),%xmm0
+ subsd %xmm2,%xmm0
+ ret
+.L__Not_Negative_Number1:
+ movapd %xmm2,%xmm0
+ ret
+
+
+ #calculation using the x87 FPU
+ #For numbers whose exponent of either of the divisor,
+ #or dividends are 0. Or for numbers whose exponential
+ #diff is grater than 52
+.align 16
+.L__LargeExpDiffComputation:
+ sub $stack_size, %rsp
+ movsd %xmm0, temp_x(%rsp)
+ movsd %xmm1, temp_y(%rsp)
+ ffree %st(0)
+ ffree %st(1)
+ fldl temp_y(%rsp)
+ fldl temp_x(%rsp)
+ fnclex
+.align 16
+.L__repeat:
+ fprem1 #Calculate remainder by dividing st(0) with st(1)
+ #fprem operation sets x87 condition codes,
+ #it will set the C2 code to 1 if a partial remainder is calculated
+ fnstsw %ax
+ and $0x0400,%ax # Stores Floating-Point Store Status Word into the accumulator
+ # we need to check only the C2 bit of the Condition codes
+ cmp $0x0400,%ax # Checks whether the bit 10(C2) is set or not
+ # IF its set then a partial remainder was calculated
+ jz .L__repeat
+ #store the result from the FPU stack to memory
+ fstpl temp_x(%rsp)
+ fstpl temp_y(%rsp)
+ movsd temp_x(%rsp), %xmm0
+ add $stack_size, %rsp
+ ret
+
+ #IF both the inputs are equal
+.L__Input_Is_Equal:
+ cmp $0x7FF,%r8
+ jz .L__Dividend_Is_Infinity
+ cmp $0x7FF,%r9
+ jz .L__InputIsNaN
+ movsd %xmm0,%xmm1
+ pand .L__sign_mask_64(%rip),%xmm1
+ movsd .L__Zero_64(%rip),%xmm0
+ por %xmm1,%xmm0
+ ret
+
+.L__InputIsNaN:
+ por .L__QNaN_mask_64(%rip),%xmm0
+ por .L__exp_mask_64(%rip),%xmm0
+.L__Dividend_Is_Infinity:
+ ret
+
+#Case when x < y
+.L__ReturnImmediate:
+ movapd %xmm5,%xmm7
+ mulpd .L__ZeroPointFive(%rip),%xmm5 #
+ comisd %xmm4,%xmm5
+ jae .L__FoundResult1
+ subsd %xmm7,%xmm4
+ comisd .L__Zero_64(%rip),%xmm0
+ ja .L__Not_Negative_Number
+.L__Negative_Number:
+ movapd .L__Zero_64(%rip),%xmm0
+ subsd %xmm4,%xmm0
+ ret
+
+.L__Not_Negative_Number:
+ movapd %xmm4,%xmm0
+ ret
+.align 16
+.L__FoundResult1:
+ ret
+
+
+
+.align 32
+.L__sign_mask_64: .quad 0x8000000000000000
+ .quad 0x0
+.L__exp_mask_64: .quad 0x7FF0000000000000
+ .quad 0x0
+.L__27bit_andingmask_64: .quad 0xfffffffff8000000
+ .quad 0
+.L__2p52_mask_64: .quad 0x4330000000000000
+ .quad 0
+.L__Zero_64: .quad 0x0
+ .quad 0
+.L__QNaN_mask_64: .quad 0x0008000000000000
+ .quad 0
+.L__Nan_64: .quad 0x7FFFFFFFFFFFFFFF
+ .quad 0
+.L__ZeroPointFive: .quad 0X3FE0000000000000
+ .quad 0
+
diff --git a/src/gas/remainderf.S b/src/gas/remainderf.S
new file mode 100644
index 0000000..d196d11
--- /dev/null
+++ b/src/gas/remainderf.S
@@ -0,0 +1,221 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# remainderf.S
+#
+# An implementation of the fabs libm function.
+#
+# Prototype:
+#
+# float remainderf(float x,float y);
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(remainderf)
+#define fname_special _remainderf_special
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+ mov .L__exp_mask_64(%rip), %rdi
+ movapd .L__sign_mask_64(%rip),%xmm6
+ cvtss2sd %xmm0,%xmm2 # double x
+ cvtss2sd %xmm1,%xmm3 # double y
+ pand %xmm6,%xmm2
+ pand %xmm6,%xmm3
+ movd %xmm2,%rax
+ movd %xmm3,%r8
+ mov %rax,%r11
+ mov %r8,%r9
+ movsd %xmm2,%xmm4
+ #take the exponents of both x and y
+ and %rdi,%rax
+ and %rdi,%r8
+ ror $52, %rax
+ ror $52, %r8
+ #ifeither of the exponents is infinity
+ cmp $0X7FF,%rax
+ jz .L__InputIsNaN
+ cmp $0X7FF,%r8
+ jz .L__InputIsNaNOrInf
+
+ cmp $0,%r8
+ jz .L__Divisor_Is_Zero
+
+ cmp %r9, %r11
+ jz .L__Input_Is_Equal
+ jb .L__ReturnImmediate
+
+ xor %rcx,%rcx
+ mov $24,%rdx
+ movsd .L__One_64(%rip),%xmm7 # xmm7 = scale
+ cmp %rax,%r8
+ jae .L__y_is_greater
+ #xmm3 = dy
+ sub %r8,%rax
+ div %dl # al = ntimes
+ mov %al,%cl # cl = ntimes
+ and $0xFF,%ax # set everything t o zero except al
+ mul %dl # ax = dl * al = 24* ntimes
+ add $1023, %rax
+ shl $52,%rax
+ movd %rax,%xmm7 # xmm7 = scale
+.L__y_is_greater:
+ mulsd %xmm3,%xmm7 # xmm7 = scale * dy
+ movsd .L__2pminus24_decimal(%rip),%xmm6
+
+.align 16
+.L__Start_Loop:
+ dec %cl
+ js .L__End_Loop
+ divsd %xmm7,%xmm4 # xmm7 = (dx / w)
+ cvttsd2siq %xmm4,%rax
+ cvtsi2sdq %rax,%xmm4 # xmm4 = t = (double)((int)(dx / w))
+ mulsd %xmm7,%xmm4 # xmm4 = w*t
+ mulsd %xmm6,%xmm7 # w*= scale
+ subsd %xmm4,%xmm2 # xmm2 = dx -= w*t
+ movsd %xmm2,%xmm4 # xmm4 = dx
+ jmp .L__Start_Loop
+.L__End_Loop:
+ divsd %xmm7,%xmm4 # xmm7 = (dx / w)
+ cvttsd2siq %xmm4,%rax
+ cvtsi2sdq %rax,%xmm4 # xmm4 = t = (double)((int)(dx / w))
+ and $0x01,%rax # todd = todd = ((int)(dx / w)) & 1
+ mulsd %xmm7,%xmm4 # xmm4 = w*t
+ subsd %xmm4,%xmm2 # xmm2 = dx -= w*t
+ movsd %xmm7,%xmm6 # store w
+ mulsd .L__Zero_Point_Five64(%rip),%xmm7 #xmm7 = 0.5*w
+
+ cmp $0x01,%rax
+ jnz .L__todd_is_even
+ comisd %xmm2,%xmm7
+ je .L__Subtract_w
+
+.L__todd_is_even:
+ comisd %xmm2,%xmm7
+ jnb .L__Dont_Subtract_w
+
+.L__Subtract_w:
+ subsd %xmm6,%xmm2
+
+.L__Dont_Subtract_w:
+ comiss .L__Zero_64(%rip),%xmm0
+ jb .L__Negative
+ cvtsd2ss %xmm2,%xmm0
+ ret
+.L__Negative:
+ movsd .L__MinusZero_64(%rip),%xmm0
+ subsd %xmm2,%xmm0
+ cvtsd2ss %xmm0,%xmm0
+ ret
+
+.align 16
+.L__Input_Is_Equal:
+ cmp $0x7FF,%rax
+ jz .L__Dividend_Is_Infinity
+ cmp $0x7FF,%r8
+ jz .L__InputIsNaNOrInf
+ movsd %xmm0,%xmm1
+ pand .L__sign_bit_32(%rip),%xmm1
+ movss .L__Zero_64(%rip),%xmm0
+ por %xmm1,%xmm0
+ ret
+
+.L__InputIsNaNOrInf:
+ comiss %xmm0,%xmm1
+ jp .L__InputIsNaN
+ ret
+.L__Divisor_Is_Zero:
+.L__InputIsNaN:
+ por .L__exp_mask_32(%rip),%xmm0
+.L__Dividend_Is_Infinity:
+ por .L__QNaN_mask_32(%rip),%xmm0
+ ret
+
+#Case when x < y
+ #xmm2 = dx
+.L__ReturnImmediate:
+ movsd %xmm3,%xmm5
+ mulsd .L__Zero_Point_Five64(%rip), %xmm3 # xmm3 = 0.5*dy
+ comisd %xmm3,%xmm2 # if (dx > 0.5*dy)
+ jna .L__Finish_Immediate # xmm2 <= xmm3
+ subsd %xmm5,%xmm2 #dx -= dy
+
+.L__Finish_Immediate:
+ comiss .L__Zero_64(%rip),%xmm0
+ #xmm0 contains the input and is the result
+ jz .L__Zero
+ ja .L__Positive
+
+ movsd .L__Zero_64(%rip),%xmm0
+ subsd %xmm2,%xmm0
+ cvtsd2ss %xmm0,%xmm0
+ ret
+
+.L__Zero:
+ ret
+
+.L__Positive:
+ cvtsd2ss %xmm2,%xmm0
+ ret
+
+
+
+.align 32
+.L__sign_bit_32: .quad 0x8000000080000000
+ .quad 0x0
+.L__exp_mask_64: .quad 0x7FF0000000000000
+ .quad 0x0
+.L__exp_mask_32: .quad 0x000000007F800000
+ .quad 0x0
+.L__27bit_andingmask_64: .quad 0xfffffffff8000000
+ .quad 0
+.L__2p52_mask_64: .quad 0x4330000000000000
+ .quad 0
+.L__One_64: .quad 0x3FF0000000000000
+ .quad 0
+.L__Zero_64: .quad 0x0
+ .quad 0
+.L__MinusZero_64: .quad 0x8000000000000000
+ .quad 0
+.L__QNaN_mask_32: .quad 0x0000000000400000
+ .quad 0
+.L__sign_mask_64: .quad 0x7FFFFFFFFFFFFFFF
+ .quad 0
+.L__2pminus24_decimal: .quad 0x3E70000000000000
+ .quad 0
+.L__Zero_Point_Five64: .quad 0x3FE0000000000000
+ .quad 0
+
diff --git a/src/gas/round.S b/src/gas/round.S
new file mode 100644
index 0000000..c1ac20a
--- /dev/null
+++ b/src/gas/round.S
@@ -0,0 +1,151 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# round.S
+#
+# An implementation of the round libm function.
+#
+# Prototype:
+#
+# double round(double x);
+#
+
+#
+# Algorithm: First get the exponent of the input
+# double precision number.
+# IF exponent is greater than 51 then return the
+# input as is.
+# IF exponent is less than 0 then force an overflow
+# by adding a huge number and subtracting with the
+# same number.
+# IF exponent is greater than 0 then add 0.5 and
+# and shift the mantissa bits based on the exponent
+# value to discard the fractional component.
+#
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(round)
+#define fname_special _round_special
+
+
+# local variable storage offsets
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+#in sse5 there is a roundss,roundsd instruction
+fname:
+ movsd .L__2p52_plus_one(%rip),%xmm4
+ movsd .L__sign_mask_64(%rip),%xmm5
+ mov $52,%r10
+ #take 3 copies of the input xmm0
+ movsd %xmm0,%xmm1
+ movsd %xmm0,%xmm2
+ movsd %xmm0,%xmm3
+ #get the Most signifacnt half word of the input number in r9
+ pand .L__exp_mask_64(%rip), %xmm1
+ pextrw $3,%xmm1,%r9
+ cmp $0X7FF0,%r9
+ #Check for infinity inputs
+ jz .L__is_infinity
+ movsd .L__sign_mask_64(%rip), %xmm1
+ pandn %xmm2,%xmm1 # xmm1 now stores the sign of the input number
+ #On shifting r9 and subtracting with 0x3FF
+ #r9 stores the exponent.
+ shr $0X4,%r9
+ sub $0x3FF,%r9
+ cmp $0x00, %r9
+ jl .L__number_less_than_zero
+
+ #IF exponent is greater than 0
+.L__number_greater_than_zero:
+ cmp $51,%r9
+ jg .L__is_greater_than_2p52
+
+ #IF exponent is greater than 0 and less than 2^52
+ pand .L__sign_mask_64(%rip),%xmm0
+ #add with 0.5
+ addsd .L__zero_point_5(%rip),%xmm0
+ movsd %xmm0,%xmm5
+
+ pand .L__exp_mask_64(%rip),%xmm5
+ pand .L__mantissa_mask_64(%rip),%xmm0
+ #r10 = r9(input exponent) - r10(52=mantissa length)
+ sub %r9,%r10
+ movd %r10, %xmm2
+ #do right and left shift by (input exp - mantissa length)
+ psrlq %xmm2,%xmm0
+ psllq %xmm2,%xmm0
+ #OR the input exponent with the input sign
+ por %xmm1,%xmm5
+ #finally OR with the matissa
+ por %xmm5,%xmm0
+ ret
+
+ #IF exponent is less than 0
+.L__number_less_than_zero:
+ pand %xmm5,%xmm3 # xmm3 =abs(input)
+ addsd %xmm4,%xmm3# add (2^52 + 1)
+ subsd %xmm4,%xmm3# sub (2^52 + 1)
+ por %xmm1, %xmm3 # OR with the sign of the input number
+ movsd %xmm3,%xmm0
+ ret
+
+ #IF the input is infinity
+.L__is_infinity:
+ comisd %xmm4,%xmm0
+ jnp .L__is_zero #parity flag is raised
+ #IF one of theinputs is a Nan
+.L__is_nan :
+ por .L__qnan_mask_64(%rip),%xmm0 # set the QNan Bit
+.L__is_zero :
+.L__is_greater_than_2p52:
+ ret
+
+.align 16
+.L__sign_mask_64: .quad 0x7FFFFFFFFFFFFFFF
+ .quad 0
+
+.L__qnan_mask_64: .quad 0x0008000000000000
+ .quad 0
+.L__exp_mask_64: .quad 0x7FF0000000000000
+ .quad 0
+.L__mantissa_mask_64: .quad 0x000FFFFFFFFFFFFF
+ .quad 0
+.L__zero: .quad 0x0000000000000000
+ .quad 0
+.L__2p52_plus_one: .quad 0x4330000000000001 # = 4503599627370497.0
+ .quad 0
+.L__zero_point_5: .quad 0x3FE0000000000001 # = 00.5
+ .quad 0
+
+
+
diff --git a/src/gas/sin.S b/src/gas/sin.S
new file mode 100644
index 0000000..378e103
--- /dev/null
+++ b/src/gas/sin.S
@@ -0,0 +1,481 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# An implementation of the sin function.
+#
+# Prototype:
+#
+# double sin(double x);
+#
+# Computes sin(x).
+# It will provide proper C99 return values,
+# but may not raise floating point status bits properly.
+# Based on the NAG C implementation.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 32
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0 # for alignment
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0
+.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5
+ .quad 0
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff # Sign bit zero
+ .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0
+
+.align 32
+.Lcosarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0
+ .quad 0x03EFA01A019F4EC91 # 2.48016e-005 c3
+ .quad 0
+ .quad 0x0bE927E4FA17F667B # -2.75573e-007 c4
+ .quad 0
+ .quad 0x03E21EEB690382EEC # 2.08761e-009 c5
+ .quad 0
+ .quad 0x0bDA907DB47258AA7 # -1.13826e-011 c6
+ .quad 0
+
+.align 32
+.Lsinarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0
+
+.text
+.align 32
+.p2align 4,,15
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(sin)
+#define fname_special _sin_special@PLT
+
+# define local variable storage offsets
+.equ p_temp, 0x30 # temporary for get/put bits operation
+.equ p_temp1, 0x40 # temporary for get/put bits operation
+.equ r, 0x50 # pointer to r for amd_remainder_piby2
+.equ rr, 0x60 # pointer to rr for amd_remainder_piby2
+.equ region, 0x70 # pointer to region for amd_remainder_piby2
+.equ stack_size, 0x98
+
+.globl fname
+.type fname,@function
+
+fname:
+ sub $stack_size, %rsp
+ xorpd %xmm2, %xmm2 # zeroed out for later use
+
+# GET_BITS_DP64(x, ux);
+# get the input value to an integer register.
+ movsd %xmm0, p_temp(%rsp)
+ mov p_temp(%rsp), %rdx # rdx is ux
+
+## if NaN or inf
+ mov $0x07ff0000000000000, %rax
+ mov %rax, %r10
+ and %rdx, %r10
+ cmp %rax, %r10
+ jz .Lsin_naninf
+
+# ax = (ux & ~SIGNBIT_DP64);
+ mov $0x07fffffffffffffff, %r10
+ and %rdx, %r10 # r10 is ax
+ mov $1, %r8d # for determining region later on
+
+## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+ mov $0x03fe921fb54442d18, %rax
+ cmp %rax, %r10
+ jg .Lsin_reduce
+
+## if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+ mov $0x03f20000000000000, %rax
+ cmp %rax, %r10
+ jge .Lsin_small
+
+## if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */
+ mov $0x03e40000000000000, %rax
+ cmp %rax, %r10
+ jge .Lsin_smaller
+
+# sin = 1.0;
+ jmp .Lsin_cleanup
+
+.align 32
+.Lsin_smaller:
+# sin = x - x^3 * 0.1666666666666666666;
+ movsd %xmm0, %xmm2
+ movsd .L__real_3fc5555555555555(%rip), %xmm4 # 0.1666666666666666666
+ mulsd %xmm2, %xmm2 # x^2
+ mulsd %xmm0, %xmm2 # x^3
+ mulsd %xmm4, %xmm2 # x^3 * 0.1666666666666666666
+ subsd %xmm2, %xmm0 # x - x^3 * 0.1666666666666666666
+ jmp .Lsin_cleanup
+
+.align 32
+.Lsin_small:
+# sin = sin_piby4(x, 0.0);
+ movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5
+
+.Lsin_piby4_noreduce:
+ movsd %xmm0, %xmm2
+ mulsd %xmm0, %xmm2 # x2
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2 - do a sin calculation
+# zs = (s2 + x2 * (s3 + x2 * (s4 + x2 * (s5 + x2 * s6))));
+ movsd .Lsinarray+0x50(%rip), %xmm3 # s6
+ mulsd %xmm2, %xmm3 # x2s6
+ movsd .Lsinarray+0x20(%rip), %xmm5 # s3
+ movsd %xmm2, %xmm1 # move for x4
+ mulsd %xmm2, %xmm1 # x4
+ mulsd %xmm2, %xmm5 # x2s3
+ movsd %xmm0, %xmm4 # move for x3
+ addsd .Lsinarray+0x40(%rip), %xmm3 # s5+x2s6
+ mulsd %xmm2, %xmm1 # x6
+ mulsd %xmm2, %xmm3 # x2(s5+x2s6)
+ mulsd %xmm2, %xmm4 # x3
+ addsd .Lsinarray+0x10(%rip), %xmm5 # s2+x2s3
+ mulsd %xmm2, %xmm5 # x2(s2+x2s3)
+ addsd .Lsinarray+0x30(%rip), %xmm3 # s4 + x2(s5+x2s6)
+ mulsd %xmm1, %xmm3 # x6(s4 + x2(s5+x2s6))
+ addsd .Lsinarray(%rip), %xmm5 # s1+x2(s2+x2s3)
+ addsd %xmm5, %xmm3 # zs
+ mulsd %xmm3, %xmm4 # *x3
+ addsd %xmm4, %xmm0 # +x
+ jmp .Lsin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsin_reduce:
+# xneg = (ax != ux);
+ cmp %r10, %rdx
+ mov $0, %r11d
+
+## if (xneg) x = -x;
+ jz .Lpositive
+ mov $1, %r11d
+ subsd %xmm0, %xmm2
+ movsd %xmm2, %xmm0
+
+.align 16
+.Lpositive:
+## if (x < 5.0e5)
+ cmp .L__real_411E848000000000(%rip), %r10
+ jae .Lsin_reduce_precise
+
+# reduce the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+ movsd %xmm0, %xmm2
+ movsd .L__real_3fe45f306dc9c883(%rip), %xmm3 # twobypi
+ movsd %xmm0, %xmm4
+ movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5
+ mulsd %xmm3, %xmm2
+
+#/* How many pi/2 is x a multiple of? */
+# xexp = ax >> EXPSHIFTBITS_DP64;
+ mov %r10, %r9
+ shr $52, %r9 # >>EXPSHIFTBITS_DP64
+
+# npi2 = (int)(x * twobypi + 0.5);
+ addsd %xmm5, %xmm2 # npi2
+
+ movsd .L__real_3ff921fb54400000(%rip), %xmm3 # piby2_1
+ cvttpd2dq %xmm2, %xmm0 # convert to integer
+ movsd .L__real_3dd0b4611a626331(%rip), %xmm1 # piby2_1tail
+ cvtdq2pd %xmm0, %xmm2 # and back to float.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+# rhead = x - npi2 * piby2_1;
+ mulsd %xmm2, %xmm3
+ subsd %xmm3, %xmm4 # rhead
+
+# rtail = npi2 * piby2_1tail;
+ mulsd %xmm2, %xmm1
+ movd %xmm0, %eax
+
+# GET_BITS_DP64(rhead-rtail, uy);
+ movsd %xmm4, %xmm0
+ subsd %xmm1, %xmm0
+
+ movsd .L__real_3dd0b4611a600000(%rip), %xmm3 # piby2_2
+ movsd %xmm0,p_temp(%rsp)
+ movsd .L__real_3ba3198a2e037073(%rip), %xmm5 # piby2_2tail
+ mov p_temp(%rsp), %rcx # rcx is rhead-rtail
+
+# xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc
+# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ shl $1, %rcx # strip any sign bit
+ shr $53, %rcx # >> EXPSHIFTBITS_DP64 +1
+ sub %rcx, %r9 # expdiff
+
+## if (expdiff > 15)
+ cmp $15, %r9
+ jle .Lexplediff15
+
+# /* The remainder is pretty small compared with x, which
+# implies that x is a near multiple of pi/2
+# (x matches the multiple to at least 15 bits) */
+
+# t = rhead;
+ movsd %xmm4, %xmm1
+
+# rtail = npi2 * piby2_2;
+ mulsd %xmm2, %xmm3
+
+# rhead = t - rtail;
+ mulsd %xmm2, %xmm5 # npi2 * piby2_2tail
+ subsd %xmm3, %xmm4 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ subsd %xmm4, %xmm1 # t - rhead
+ subsd %xmm3, %xmm1 # -rtail
+ subsd %xmm1, %xmm5 # rtail
+
+# r = rhead - rtail;
+ movsd %xmm4, %xmm0
+
+#HARSHA
+#xmm1=rtail
+ movsd %xmm5, %xmm1
+ subsd %xmm5, %xmm0
+
+# xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexplediff15:
+# region = npi2 & 3;
+
+ subsd %xmm0, %xmm4 # rhead-r
+ subsd %xmm1, %xmm4 # rr = (rhead-r) - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+## if the input was close to a pi/2 multiple
+# The original NAG code missed this trick. If the input is very close to n*pi/2 after
+# reduction,
+# then the sin is ~ 1.0 , to within 53 bits, when r is < 2^-27. We already
+# have x at this point, so we can skip the sin polynomials.
+
+ cmp $0x03f2, %rcx # if r small.
+ jge .Lsin_piby4 # use taylor series if not
+ cmp $0x03de, %rcx # if r really small.
+ jle .Lr_small # then sin(r) = 0
+
+ movsd %xmm0, %xmm2
+ mulsd %xmm2, %xmm2 # x^2
+
+## if region is 0 or 2 do a sin calc.
+ and %eax, %r8d
+ jnz .Lcossmall
+
+# region 0 or 2 do a sin calculation
+# use simply polynomial
+# x - x*x*x*0.166666666666666666;
+ movsd .L__real_3fc5555555555555(%rip), %xmm3
+ mulsd %xmm0, %xmm3 # * x
+ mulsd %xmm2, %xmm3 # * x^2
+ subsd %xmm3, %xmm0 # xs
+ jmp .Ladjust_region
+
+.align 16
+.Lcossmall:
+# region 1 or 3 do a cos calculation
+# use simply polynomial
+# 1.0 - x*x*0.5;
+ movsd .L__real_3ff0000000000000(%rip), %xmm0 # 1.0
+ mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x^2
+ subsd %xmm2, %xmm0 # xc
+ jmp .Ladjust_region
+
+.align 16
+.Lr_small:
+## if region is 1 or 3 do a cos calc.
+ and %eax, %r8d
+ jz .Ladjust_region
+
+# odd
+ movsd .L__real_3ff0000000000000(%rip), %xmm0 # cos(r) is a 1
+ jmp .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 32
+.Lsin_reduce_precise:
+# // Reduce x into range [-pi/4,pi/4]
+# __amd_remainder_piby2(x, &r, &rr, ®ion);
+
+ mov %r11,p_temp(%rsp)
+ lea region(%rsp), %rdx
+ lea rr(%rsp), %rsi
+ lea r(%rsp), %rdi
+
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp), %r11
+ mov $1, %r8d # for determining region later on
+ movsd r(%rsp), %xmm0 # x
+ movsd rr(%rsp), %xmm4 # xx
+ mov region(%rsp), %eax # region
+
+# xmm0 = x, xmm4 = xx, r8d = 1, eax= region
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+# perform taylor series to calc sinx, sinx
+.Lsin_piby4:
+# x2 = r * r;
+
+#xmm4 = a part of rr for the sin path, xmm4 is overwritten in the sin path
+#instead use xmm3 because that was freed up in the sin path, xmm3 is overwritten in sin path
+ movsd %xmm0, %xmm3
+ movsd %xmm0, %xmm2
+ mulsd %xmm0, %xmm2 # x2
+
+## if region is 0 or 2 do a sin calc.
+ and %eax, %r8d
+ jnz .Lcosregion
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2 do a sin calculation
+ movsd .Lsinarray+0x50(%rip), %xmm3 # s6
+ mulsd %xmm2, %xmm3 # x2s6
+ movsd .Lsinarray+0x20(%rip), %xmm5 # s3
+ movsd %xmm4,p_temp(%rsp) # store xx
+ movsd %xmm2, %xmm1 # move for x4
+ mulsd %xmm2, %xmm1 # x4
+ movsd %xmm0,p_temp1(%rsp) # store x
+ mulsd %xmm2, %xmm5 # x2s3
+ movsd %xmm0, %xmm4 # move for x3
+ addsd .Lsinarray+0x40(%rip), %xmm3 # s5+x2s6
+ mulsd %xmm2, %xmm1 # x6
+ mulsd %xmm2, %xmm3 # x2(s5+x2s6)
+ mulsd %xmm2, %xmm4 # x3
+ addsd .Lsinarray+0x10(%rip), %xmm5 # s2+x2s3
+ mulsd %xmm2, %xmm5 # x2(s2+x2s3)
+ addsd .Lsinarray+0x30(%rip), %xmm3 # s4 + x2(s5+x2s6)
+ mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x2
+ movsd p_temp(%rsp), %xmm0 # load xx
+ mulsd %xmm1, %xmm3 # x6(s4 + x2(s5+x2s6))
+ addsd .Lsinarray(%rip), %xmm5 # s1+x2(s2+x2s3)
+ mulsd %xmm0, %xmm2 # 0.5 * x2 *xx
+ addsd %xmm5, %xmm3 # zs
+ mulsd %xmm3, %xmm4 # *x3
+ subsd %xmm2, %xmm4 # x3*zs - 0.5 * x2 *xx
+ addsd %xmm4, %xmm0 # +xx
+ addsd p_temp1(%rsp), %xmm0 # +x
+ jmp .Ladjust_region
+
+.align 16
+.Lcosregion:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 1 or 3 - do a cos calculation
+# zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6))));
+ mulsd %xmm0, %xmm4 # x*xx
+ movsd .L__real_3fe0000000000000(%rip), %xmm5
+ movsd .Lcosarray+0x50(%rip), %xmm1 # c6
+ movsd .Lcosarray+0x20(%rip), %xmm0 # c3
+ mulsd %xmm2, %xmm5 # r = 0.5 *x2
+ movsd %xmm2, %xmm3 # copy of x2
+ movsd %xmm4,p_temp(%rsp) # store x*xx
+ mulsd %xmm2, %xmm1 # c6*x2
+ mulsd %xmm2, %xmm0 # c3*x2
+ subsd .L__real_3ff0000000000000(%rip), %xmm5 # -t=r-1.0 ;trash r
+ mulsd %xmm2, %xmm3 # x4
+ addsd .Lcosarray+0x40(%rip), %xmm1 # c5+x2c6
+ addsd .Lcosarray+0x10(%rip), %xmm0 # c2+x2C3
+ addsd .L__real_3ff0000000000000(%rip), %xmm5 # 1 + (-t) ;trash t
+ mulsd %xmm2, %xmm3 # x6
+ mulsd %xmm2, %xmm1 # x2(c5+x2c6)
+ mulsd %xmm2, %xmm0 # x2(c2+x2C3)
+ movsd %xmm2, %xmm4 # copy of x2
+ mulsd .L__real_3fe0000000000000(%rip), %xmm4 # r recalculate
+ addsd .Lcosarray+0x30(%rip), %xmm1 # c4 + x2(c5+x2c6)
+ addsd .Lcosarray(%rip), %xmm0 # c1+x2(c2+x2C3)
+ mulsd %xmm2, %xmm2 # x4 recalculate
+ subsd %xmm4, %xmm5 # (1 + (-t)) - r
+ mulsd %xmm3, %xmm1 # x6(c4 + x2(c5+x2c6))
+ addsd %xmm1, %xmm0 # zc
+ subsd .L__real_3ff0000000000000(%rip), %xmm4 # t relaculate
+ subsd p_temp(%rsp), %xmm5 # ((1 + (-t)) - r) - x*xx
+ mulsd %xmm2, %xmm0 # x4 * zc
+ addsd %xmm5, %xmm0 # x4 * zc + ((1 + (-t)) - r -x*xx)
+ subsd %xmm4, %xmm0 # result - (-t)
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 16
+.Ladjust_region: # positive or negative
+# switch (region)
+ shr $1, %eax
+ mov %eax, %ecx
+ and %r11d, %eax
+ not %ecx
+ not %r11d
+ and %r11d, %ecx
+ or %ecx, %eax
+ and $1, %eax
+ jnz .Lsin_cleanup
+
+## if the original region 0, 1 and arg is negative, then we negate the result.
+## if the original region 2, 3 and arg is positive, then we negate the result.
+ movsd %xmm0, %xmm2
+ xorpd %xmm0, %xmm0
+ subsd %xmm2, %xmm0
+
+.align 16
+.Lsin_cleanup:
+ add $stack_size, %rsp
+ ret
+
+.align 16
+.Lsin_naninf:
+ call fname_special
+ add $stack_size, %rsp
+ ret
+
+
+
diff --git a/src/gas/sincos.S b/src/gas/sincos.S
new file mode 100644
index 0000000..6558f9e
--- /dev/null
+++ b/src/gas/sincos.S
@@ -0,0 +1,616 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# An implementation of the sincos function.
+#
+# Prototype:
+#
+# void sincos(double x, double* sinr, double* cosr);
+#
+# Computes sincos
+# It will provide proper C99 return values,
+# but may not raise floating point status bits properly.
+# Based on the NAG C implementation.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0 # for alignment
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0
+.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5
+ .quad 0
+
+.align 16
+.Lsincosarray:
+ .quad 0x0bfc5555555555555 # -0.16666666666666666 s1
+ .quad 0x03fa5555555555555 # 0.041666666666666664 c1
+ .quad 0x03f81111111110bb3 # 0.00833333333333095 s2
+ .quad 0x0bf56c16c16c16967 # -0.0013888888888887398 c2
+ .quad 0x0bf2a01a019e83e5c # -0.00019841269836761127 s3
+ .quad 0x03efa01a019f4ec90 # 2.4801587298767041E-05 c3
+ .quad 0x03ec71de3796cde01 # 2.7557316103728802E-06 s4
+ .quad 0x0be927e4fa17f65f6 # -2.7557317272344188E-07 c4
+ .quad 0x0be5ae600b42fdfa7 # -2.5051132068021698E-08 s5
+ .quad 0x03e21eeb69037ab78 # 2.0876146382232963E-09 c6
+ .quad 0x03de5e0b2f9a43bb8 # 1.5918144304485914E-10 s6
+ .quad 0x0bda907db46cc5e42 # -1.1382639806794487E-11 c7
+
+.align 16
+.Lcossinarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0bda907db46cc5e42
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+
+.text
+.align 16
+.p2align 4,,15
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(sincos)
+#define fname_special _sincos_special@PLT
+
+# define local variable storage offsets
+.equ p_temp, 0x30 # temporary for get/put bits operation
+.equ p_temp1, 0x40 # temporary for get/put bits operation
+.equ r, 0x50 # pointer to r for amd_remainder_piby2
+.equ rr, 0x60 # pointer to rr for amd_remainder_piby2
+.equ region, 0x70 # pointer to region for amd_remainder_piby2
+.equ stack_size, 0x98
+
+.globl fname
+.type fname,@function
+
+fname:
+ sub $stack_size, %rsp
+ xorpd %xmm2,%xmm2 # zeroed out for later use
+
+# GET_BITS_DP64(x, ux);
+# get the input value to an integer register.
+ movsd %xmm0,p_temp(%rsp)
+ mov p_temp(%rsp),%rcx # rcx is ux
+
+## if NaN or inf
+ mov $0x07ff0000000000000,%rax
+ mov %rax,%r10
+ and %rcx,%r10
+ cmp %rax,%r10
+ jz .Lsincos_naninf
+
+# ax = (ux & ~SIGNBIT_DP64);
+ mov $0x07fffffffffffffff,%r10
+ and %rcx,%r10 # r10 is ax
+
+## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+ mov $0x03fe921fb54442d18,%rax
+ cmp %rax,%r10
+ jg .Lsincos_reduce
+
+## if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+ mov $0x03f20000000000000,%rax
+ cmp %rax,%r10
+ jge .Lsincos_small
+
+## if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */
+ mov $0x03e40000000000000,%rax
+ cmp %rax,%r10
+ jge .Lsincos_smaller
+
+ # sin = x;
+ movsd .L__real_3ff0000000000000(%rip),%xmm1 # cos = 1.0;
+ jmp .Lsincos_cleanup
+
+## else
+.align 32
+.Lsincos_smaller:
+# sin = x - x^3 * 0.1666666666666666666;
+# cos = 1.0 - x*x*0.5;
+
+ movsd %xmm0,%xmm2
+ movsd .L__real_3fc5555555555555(%rip),%xmm4 # 0.1666666666666666666
+ mulsd %xmm2,%xmm2 # x^2
+ movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1.0
+ movsd %xmm2,%xmm3 # copy of x^2
+
+ mulsd %xmm0,%xmm2 # x^3
+ mulsd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 * x^2
+ mulsd %xmm4,%xmm2 # x^3 * 0.1666666666666666666
+ subsd %xmm2,%xmm0 # x - x^3 * 0.1666666666666666666, sin
+ subsd %xmm3,%xmm1 # 1 - 0.5 * x^2, cos
+
+ jmp .Lsincos_cleanup
+
+
+## else
+
+.align 16
+.Lsincos_small:
+# sin = sin_piby4(x, 0.0);
+ movsd .L__real_3fe0000000000000(%rip),%xmm5 # .5
+
+# x2 = r * r;
+ movsd %xmm0,%xmm2
+ mulsd %xmm0,%xmm2 # x2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2 - do a sin calculation
+# zs = (s2 + x2 * (s3 + x2 * (s4 + x2 * (s5 + x2 * s6))));
+
+ movlhps %xmm2,%xmm2
+ movapd .Lsincosarray+0x50(%rip),%xmm3 # s6
+ movapd %xmm2,%xmm1 # move for x4
+ movdqa .Lsincosarray+0x20(%rip),%xmm5 # s3
+ mulpd %xmm2,%xmm3 # x2s6
+ addpd .Lsincosarray+0x40(%rip),%xmm3 # s5+x2s6
+ mulpd %xmm2,%xmm5 # x2s3
+ movapd %xmm4,p_temp(%rsp) # rr move to to memory
+ mulpd %xmm2,%xmm1 # x4
+ mulpd %xmm2,%xmm3 # x2(s5+x2s6)
+ addpd .Lsincosarray+0x10(%rip),%xmm5 # s2+x2s3
+ movapd %xmm1,%xmm4 # move for x6
+ addpd .Lsincosarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6)
+ mulpd %xmm2,%xmm5 # x2(s2+x2s3)
+ mulpd %xmm2,%xmm4 # x6
+ addpd .Lsincosarray(%rip),%xmm5 # s1+x2(s2+x2s3)
+ mulpd %xmm4,%xmm3 # x6(s4 + x2(s5+x2s6))
+
+ movsd %xmm2,%xmm4 # xmm4 = x2 for 0.5x2 for cos
+ # xmm2 contains x2 for x3 for sin
+ addpd %xmm5,%xmm3 # zs in lower and zc upper
+
+ mulsd %xmm0,%xmm2 # xmm2=x3 for sin
+
+ movhlps %xmm3,%xmm5 # Copy z, xmm5 = cos , xmm3 = sin
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm4 # xmm4=r=0.5*x2 for cos term
+ mulsd %xmm2,%xmm3 # sin *x3
+ mulsd %xmm1,%xmm5 # cos *x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm2 # 1.0
+ subsd %xmm4,%xmm2 # t=1.0-r
+ movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1.0
+ subsd %xmm2,%xmm1 # 1 - t
+ subsd %xmm4,%xmm1 # (1-t) -r
+ addsd %xmm5,%xmm1 # ((1-t) -r) + cos
+ addsd %xmm3,%xmm0 # xmm0= sin+x, final sin term
+ addsd %xmm2,%xmm1 # xmm1 = t +{ ((1-t) -r) + cos}, final cos term
+
+ jmp .Lsincos_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsincos_reduce:
+# change rdx to rcx and r8 to r9
+# rcx= ux, r10 = ax
+# %r9,%rax are free
+
+# xneg = (ax != ux);
+ cmp %r10,%rcx
+ mov $0,%r11d
+
+## if (xneg) x = -x;
+ jz .LPositive
+ mov $1,%r11d
+ subsd %xmm0,%xmm2
+ movsd %xmm2,%xmm0
+
+# rcx= ux, r10 = ax, r11= Sign
+# %r9,%rax are free
+# change rdx to rcx and r8 to r9
+
+.align 16
+.LPositive:
+## if (x < 5.0e5)
+ cmp .L__real_411E848000000000(%rip),%r10
+ jae .Lsincos_reduce_precise
+
+# reduce the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+ movsd %xmm0,%xmm2
+ movsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # twobypi
+ movsd %xmm0,%xmm4
+ movsd .L__real_3fe0000000000000(%rip),%xmm5 # .5
+ mulsd %xmm3,%xmm2
+
+#/* How many pi/2 is x a multiple of? */
+# xexp = ax >> EXPSHIFTBITS_DP64;
+ shr $52,%r10 # >>EXPSHIFTBITS_DP64
+
+# npi2 = (int)(x * twobypi + 0.5);
+ addsd %xmm5,%xmm2 # npi2
+
+ movsd .L__real_3ff921fb54400000(%rip),%xmm3 # piby2_1
+ cvttpd2dq %xmm2,%xmm0 # convert to integer
+ movsd .L__real_3dd0b4611a626331(%rip),%xmm1 # piby2_1tail
+ cvtdq2pd %xmm0,%xmm2 # and back to float.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+# rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm3
+ subsd %xmm3,%xmm4 # rhead
+
+# rtail = npi2 * piby2_1tail;
+ mulsd %xmm2,%xmm1
+ movd %xmm0,%eax
+
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+ movsd %xmm4,%xmm0
+ subsd %xmm1,%xmm0
+
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm3 # piby2_2
+ movsd %xmm0,p_temp(%rsp)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm5 # piby2_2tail
+ mov %eax,%ecx
+ mov p_temp(%rsp),%r9 # rcx is rhead-rtail
+
+# xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc
+# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ shl $1,%r9 # strip any sign bit
+ shr $53,%r9 # >> EXPSHIFTBITS_DP64 +1
+ sub %r9,%r10 # expdiff
+
+## if (expdiff > 15)
+ cmp $15,%r10
+ jle .Lexpdiff15
+
+# /* The remainder is pretty small compared with x, which
+# implies that x is a near multiple of pi/2
+# (x matches the multiple to at least 15 bits) */
+
+# t = rhead;
+ movsd %xmm4,%xmm1
+
+# rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm3
+
+# rhead = t - rtail;
+ mulsd %xmm2,%xmm5 # npi2 * piby2_2tail
+ subsd %xmm3,%xmm4 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ subsd %xmm4,%xmm1 # t - rhead
+ subsd %xmm3,%xmm1 # -rtail
+ subsd %xmm1,%xmm5 # rtail
+
+# r = rhead - rtail;
+ movsd %xmm4,%xmm0
+
+#HARSHA
+#xmm1=rtail
+ movsd %xmm5,%xmm1
+ subsd %xmm5,%xmm0
+
+# xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexpdiff15:
+# region = npi2 & 3;
+
+ subsd %xmm0,%xmm4 # rhead-r
+ subsd %xmm1,%xmm4 # rr = (rhead-r) - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+## if the input was close to a pi/2 multiple
+# The original NAG code missed this trick. If the input is very close to n*pi/2 after
+# reduction,
+# then the sin is ~ 1.0 , to within 53 bits, when r is < 2^-27. We already
+# have x at this point, so we can skip the sin polynomials.
+
+ cmp $0x03f2,%r9 # if r small.
+ jge .Lcossin_piby4 # use taylor series if not
+ cmp $0x03de,%r9 # if r really small.
+ jle .Lr_small # then sin(r) = 0
+
+ movsd %xmm0,%xmm2
+ mulsd %xmm2,%xmm2 # x^2
+
+## if region is 0 or 2 do a sin calc.
+ and $1,%ecx
+ jnz .Lregion13
+
+# region 0 or 2 do a sincos calculation
+# use simply polynomial
+# sin=x - x*x*x*0.166666666666666666;
+ movsd .L__real_3fc5555555555555(%rip),%xmm3 # 0.166666666
+ mulsd %xmm0,%xmm3 # * x
+ mulsd %xmm2,%xmm3 # * x^2
+ subsd %xmm3,%xmm0 # xs
+# cos=1.0 - x*x*0.5;
+ movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1.0
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x^2
+ subsd %xmm2,%xmm1 # xc
+
+ jmp .Ladjust_region
+
+.align 16
+.Lregion13:
+# region 1 or 3 do a cossin calculation
+# use simply polynomial
+# sin=x - x*x*x*0.166666666666666666;
+ movsd %xmm0,%xmm1
+
+ movsd .L__real_3fc5555555555555(%rip),%xmm3 # 0.166666666
+ mulsd %xmm0,%xmm3 # 0.166666666* x
+ mulsd %xmm2,%xmm3 # 0.166666666* x * x^2
+ subsd %xmm3,%xmm1 # xs
+# cos=1.0 - x*x*0.5;
+ movsd .L__real_3ff0000000000000(%rip),%xmm0 # 1.0
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x^2
+ subsd %xmm2,%xmm0 # xc
+
+ jmp .Ladjust_region
+
+.align 16
+.Lr_small:
+## if region is 0 or 2 do a sincos calc.
+ movsd .L__real_3ff0000000000000(%rip),%xmm1 # cos(r) is a 1
+ and $1,%ecx
+ jz .Ladjust_region
+
+## if region is 1 or 3 do a cossin calc.
+ movsd %xmm0,%xmm1 # sin(r) is r
+ movsd .L__real_3ff0000000000000(%rip),%xmm0 # cos(r) is a 1
+ jmp .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsincos_reduce_precise:
+# // Reduce x into range [-pi/4,pi/4]
+# __amd_remainder_piby2(x, &r, &rr, ®ion);
+
+ mov %rdi, p_temp1(%rsp)
+ mov %rsi, p_temp1+8(%rsp)
+ mov %r11,p_temp(%rsp)
+
+ lea region(%rsp),%rdx
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp1(%rsp), %rdi
+ mov p_temp1+8(%rsp), %rsi
+ mov p_temp(%rsp),%r11
+
+ movsd r(%rsp),%xmm0 # x
+ movsd rr(%rsp),%xmm4 # xx
+ mov region(%rsp),%eax # region to classify for sin/cos calc
+ mov %eax,%ecx # region to get sign
+
+# xmm0 = x, xmm4 = xx, r8d = 1, eax= region
+.align 16
+.Lcossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# perform taylor series to calc sinx, sinx
+# x2 = r * r;
+#xmm4 = a part of rr for the sin path, xmm4 is overwritten in the sin path
+#instead use xmm3 because that was freed up in the sin path, xmm3 is overwritten in sin path
+
+ movsd %xmm0,%xmm2
+ mulsd %xmm0,%xmm2 #x2
+
+## if region is 0 or 2 do a sincos calc.
+ and $1,%ecx
+ jz .Lsincos02
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 1 or 3 - do a cossin calculation
+# zc = (c2 + x2 * (c3 + x2 * (c4 + x2 * (c5 + x2 * c6))));
+
+
+ movlhps %xmm2,%xmm2
+
+ movapd .Lcossinarray+0x50(%rip),%xmm3 # s6
+ movapd %xmm2,%xmm1 # move for x4
+ movdqa .Lcossinarray+0x20(%rip),%xmm5 # s3
+ mulpd %xmm2,%xmm3 # x2s6
+ addpd .Lcossinarray+0x40(%rip),%xmm3 # s5+x2s6
+ mulpd %xmm2,%xmm5 # x2s3
+ movsd %xmm4,p_temp(%rsp) # rr move to to memory
+ mulpd %xmm2,%xmm1 # x4
+ mulpd %xmm2,%xmm3 # x2(s5+x2s6)
+ addpd .Lcossinarray+0x10(%rip),%xmm5 # s2+x2s3
+ movapd %xmm1,%xmm4 # move for x6
+ addpd .Lcossinarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6)
+ mulpd %xmm2,%xmm5 # x2(s2+x2s3)
+ mulpd %xmm2,%xmm4 # x6
+ addpd .Lcossinarray(%rip),%xmm5 # s1+x2(s2+x2s3)
+ mulpd %xmm4,%xmm3 # x6(s4 + x2(s5+x2s6))
+
+ movsd %xmm2,%xmm4 # xmm4 = x2 for 0.5x2 cos
+ # xmm2 contains x2 for x3 sin
+
+ addpd %xmm5,%xmm3 # zc in lower and zs in upper
+
+ mulsd %xmm0,%xmm2 # xmm2=x3 for the sin term
+
+ movhlps %xmm3,%xmm5 # Copy z, xmm5 = sin, xmm3 = cos
+ mulsd .L__real_3fe0000000000000(%rip),%xmm4 # xmm4=r=0.5*x2 for cos term
+
+ mulsd %xmm2,%xmm5 # sin *x3
+ mulsd %xmm1,%xmm3 # cos *x4
+ movsd %xmm0,p_temp1(%rsp) # store x
+ movsd %xmm0,%xmm1
+
+ movsd .L__real_3ff0000000000000(%rip),%xmm2 # 1.0
+ subsd %xmm4,%xmm2 # t=1.0-r
+
+ movsd .L__real_3ff0000000000000(%rip),%xmm0 # 1.0
+ subsd %xmm2,%xmm0 # 1 - t
+
+ mulsd p_temp(%rsp),%xmm1 # x*xx
+ subsd %xmm4,%xmm0 # (1-t) -r
+ subsd %xmm1,%xmm0 # ((1-t) -r) - x *xx
+
+ mulsd p_temp(%rsp),%xmm4 # 0.5*x2*xx
+
+ addsd %xmm3,%xmm0 # (((1-t) -r) - x *xx) + cos
+
+ subsd %xmm4,%xmm5 # sin - 0.5*x2*xx
+
+ addsd %xmm2,%xmm0 # xmm0 = t +{ (((1-t) -r) - x *xx) + cos}, final cos term
+
+ addsd p_temp(%rsp),%xmm5 # sin + xx
+ movsd p_temp1(%rsp),%xmm1 # load x
+ addsd %xmm5,%xmm1 # xmm1= sin+x, final sin term
+
+ jmp .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsincos02:
+# region 0 or 2 do a sincos calculation
+ movlhps %xmm2,%xmm2
+
+ movapd .Lsincosarray+0x50(%rip),%xmm3 # s6
+ movapd %xmm2,%xmm1 # move for x4
+ movdqa .Lsincosarray+0x20(%rip),%xmm5 # s3
+ mulpd %xmm2,%xmm3 # x2s6
+ addpd .Lsincosarray+0x40(%rip),%xmm3 # s5+x2s6
+ mulpd %xmm2,%xmm5 # x2s3
+ movsd %xmm4,p_temp(%rsp) # rr move to to memory
+ mulpd %xmm2,%xmm1 # x4
+ mulpd %xmm2,%xmm3 # x2(s5+x2s6)
+ addpd .Lsincosarray+0x10(%rip),%xmm5 # s2+x2s3
+ movapd %xmm1,%xmm4 # move for x6
+ addpd .Lsincosarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6)
+ mulpd %xmm2,%xmm5 # x2(s2+x2s3)
+ mulpd %xmm2,%xmm4 # x6
+ addpd .Lsincosarray(%rip),%xmm5 # s1+x2(s2+x2s3)
+ mulpd %xmm4,%xmm3 # x6(s4 + x2(s5+x2s6))
+
+ movsd %xmm2,%xmm4 # xmm4 = x2 for 0.5x2 for cos
+ # xmm2 contains x2 for x3 for sin
+
+ addpd %xmm5,%xmm3 # zs in lower and zc in upper
+
+ mulsd %xmm0,%xmm2 # xmm2=x3 for sin
+
+ movhlps %xmm3,%xmm5 # Copy z, xmm5 = cos , xmm3 = sin
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm4 # xmm4=r=0.5*x2 for cos term
+
+ mulsd %xmm2,%xmm3 # sin *x3
+ mulsd %xmm1,%xmm5 # cos *x4
+
+ movsd .L__real_3ff0000000000000(%rip),%xmm2 # 1.0
+ subsd %xmm4,%xmm2 # t=1.0-r
+
+ movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1.0
+ subsd %xmm2,%xmm1 # 1 - t
+
+ movsd %xmm0,p_temp1(%rsp) # store x
+ mulsd p_temp(%rsp),%xmm0 # x*xx
+
+ subsd %xmm4,%xmm1 # (1-t) -r
+ subsd %xmm0,%xmm1 # ((1-t) -r) - x *xx
+
+ mulsd p_temp(%rsp),%xmm4 # 0.5*x2*xx
+
+ addsd %xmm5,%xmm1 # (((1-t) -r) - x *xx) + cos
+
+ subsd %xmm4,%xmm3 # sin - 0.5*x2*xx
+
+ addsd %xmm2,%xmm1 # xmm1 = t +{ (((1-t) -r) - x *xx) + cos}, final cos term
+
+ addsd p_temp(%rsp),%xmm3 # sin + xx
+ movsd p_temp1(%rsp),%xmm0 # load x
+ addsd %xmm3,%xmm0 # xmm0= sin+x, final sin term
+
+ jmp .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# switch (region)
+.align 16
+.Ladjust_region: # positive or negative for sin return val in xmm0
+
+ mov %eax,%r9d
+
+ shr $1,%eax
+ mov %eax,%ecx
+ and %r11d,%eax
+
+ not %ecx
+ not %r11d
+ and %r11d,%ecx
+
+ or %ecx,%eax
+ and $1,%eax
+ jnz .Lcos_sign
+
+## if the original region 0, 1 and arg is negative, then we negate the result.
+## if the original region 2, 3 and arg is positive, then we negate the result.
+ movsd %xmm0,%xmm2
+ xorpd %xmm0,%xmm0
+ subsd %xmm2,%xmm0
+
+.Lcos_sign: # positive or negative for cos return val in xmm1
+ add $1,%r9
+ and $2,%r9d
+ jz .Lsincos_cleanup
+## if the original region 1 or 2 then we negate the result.
+ movsd %xmm1,%xmm2
+ xorpd %xmm1,%xmm1
+ subsd %xmm2,%xmm1
+
+#.align 16
+.Lsincos_cleanup:
+ movsd %xmm0, (%rdi) # save the sin
+ movsd %xmm1, (%rsi) # save the cos
+
+ add $stack_size,%rsp
+ ret
+
+.align 16
+.Lsincos_naninf:
+ call fname_special
+ add $stack_size, %rsp
+ ret
+
diff --git a/src/gas/sincosf.S b/src/gas/sincosf.S
new file mode 100644
index 0000000..dcdbe9a
--- /dev/null
+++ b/src/gas/sincosf.S
@@ -0,0 +1,402 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# An implementation of the sincosf function.
+#
+# Prototype:
+#
+# void fastsincosf(float x, float * sinfx, float * cosfx);
+#
+# Computes sinf(x) and cosf(x).
+# It will provide proper C99 return values,
+# but may not raise floating point status bits properly.
+# Based on the NAG C implementation.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 32
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0 # for alignment
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0
+.L__real_3FF921FB54442D18: .quad 0x03FF921FB54442D18 # piby2
+ .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0
+.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5
+ .quad 0
+
+.align 32
+.Lcsarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6
+
+.text
+.align 16
+.p2align 4,,15
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(sincosf)
+#define fname_special _sincosf_special@PLT
+
+# define local variable storage offsets
+.equ p_temp, 0x30 # temporary for get/put bits operation
+.equ p_temp1, 0x40 # temporary for get/put bits operation
+.equ p_temp2, 0x50 # temporary for get/put bits operation
+.equ p_temp3, 0x60 # temporary for get/put bits operation
+.equ region, 0x70 # pointer to region for amd_remainder_piby2
+.equ r, 0x80 # pointer to r for amd_remainder_piby2
+.equ stack_size, 0xa8
+
+.globl fname
+.type fname,@function
+
+fname:
+ sub $stack_size, %rsp
+
+ xorpd %xmm2,%xmm2
+
+# GET_BITS_DP64(x, ux);
+# convert input to double.
+ cvtss2sd %xmm0,%xmm0
+# get the input value to an integer register.
+ movsd %xmm0,p_temp(%rsp)
+ mov p_temp(%rsp),%rdx # rdx is ux
+
+## if NaN or inf
+ mov $0x07ff0000000000000,%rax
+ mov %rax,%r10
+ and %rdx,%r10
+ cmp %rax,%r10
+ jz .L__sc_naninf
+
+# ax = (ux & ~SIGNBIT_DP64);
+ mov $0x07fffffffffffffff,%r10
+ and %rdx,%r10 # r10 is ax
+
+## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+ mov $0x03fe921fb54442d18,%rax
+ cmp %rax,%r10
+ jg .L__sc_reduce
+
+## if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+ mov $0x3f20000000000000, %rax
+ cmp %rax, %r10
+ jge .L__sc_notsmallest
+
+# sinf = x, cos=1.0
+ movsd .L__real_3ff0000000000000(%rip),%xmm1
+ jmp .L__sc_cleanup
+
+# *s = sin_piby4(x, 0.0);
+# *c = cos_piby4(x, 0.0);
+.L__sc_notsmallest:
+ xor %eax,%eax # region 0
+ mov %r10,%rdx
+ movsd .L__real_3fe0000000000000(%rip),%xmm5 # .5
+ jmp .L__sc_piby4
+
+.L__sc_reduce:
+
+# reduce the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+
+# xneg = (ax != ux);
+ cmp %r10,%rdx
+## if (xneg) x = -x;
+ jz .Lpositive
+ subsd %xmm0,%xmm2
+ movsd %xmm2,%xmm0
+
+.align 16
+.Lpositive:
+## if (x < 5.0e5)
+ cmp .L__real_411E848000000000(%rip),%r10
+ jae .Lsincosf_reduce_precise
+
+ movsd %xmm0,%xmm2
+ movsd %xmm0,%xmm4
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # twobypi
+ movsd .L__real_3fe0000000000000(%rip),%xmm5 # .5
+
+#/* How many pi/2 is x a multiple of? */
+# xexp = ax >> EXPSHIFTBITS_DP64;
+ mov %r10,%r9
+ shr $52,%r9 # >> EXPSHIFTBITS_DP64
+
+# npi2 = (int)(x * twobypi + 0.5);
+ addsd %xmm5,%xmm2 # npi2
+
+ movsd .L__real_3ff921fb54400000(%rip),%xmm3 # piby2_1
+ cvttpd2dq %xmm2,%xmm0 # convert to integer
+ movsd .L__real_3dd0b4611a626331(%rip),%xmm1 # piby2_1tail
+ cvtdq2pd %xmm0,%xmm2 # and back to float.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+# rhead = x - npi2 * piby2_1;
+# /* Subtract the multiple from x to get an extra-precision remainder */
+# rhead = x - npi2 * piby2_1;
+
+ mulsd %xmm2,%xmm3 # use piby2_1
+ subsd %xmm3,%xmm4 # rhead
+
+# rtail = npi2 * piby2_1tail;
+ mulsd %xmm2,%xmm1 # rtail
+
+ movd %xmm0,%eax
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+ movsd %xmm4,%xmm0
+ subsd %xmm1,%xmm0
+
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm3 # piby2_2
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm5 # piby2_2tail
+ movd %xmm0,%rcx
+
+# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ shl $1,%rcx # strip any sign bit
+ shr $53,%rcx # >> EXPSHIFTBITS_DP64 +1
+ sub %rcx,%r9 # expdiff
+
+## if (expdiff > 15)
+ cmp $15,%r9
+ jle .Lexpdiff15
+
+# /* The remainder is pretty small compared with x, which
+# implies that x is a near multiple of pi/2
+# (x matches the multiple to at least 15 bits) */
+
+# t = rhead;
+ movsd %xmm4,%xmm1
+
+# rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm3
+
+# rhead = t - rtail;
+ mulsd %xmm2,%xmm5 # npi2 * piby2_2tail
+ subsd %xmm3,%xmm4 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ subsd %xmm4,%xmm1 # t - rhead
+ subsd %xmm3,%xmm1 # -rtail
+ subsd %xmm1,%xmm5 # rtail
+
+# r = rhead - rtail;
+ movsd %xmm4,%xmm0
+
+#HARSHA
+#xmm1=rtail
+ movsd %xmm5,%xmm1
+ subsd %xmm5,%xmm0
+
+# region = npi2 & 3;
+# and $3,%eax
+# xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexpdiff15:
+
+## if the input was close to a pi/2 multiple
+#
+
+ cmp $0x03f2,%rcx # if r small.
+ jge .L__sc_piby4 # use taylor series if not
+ cmp $0x03de,%rcx # if r really small.
+ jle .Lsinsmall # then sin(r) = r
+
+ movsd %xmm0,%xmm2
+ mulsd %xmm2,%xmm2 # x^2
+# use simply polynomial
+# *s = x - x*x*x*0.166666666666666666;
+ movsd .L__real_3fc5555555555555(%rip),%xmm3 #
+ mulsd %xmm0,%xmm3 # * x
+ mulsd %xmm2,%xmm3 # * x^2
+ subsd %xmm3,%xmm0 # xs
+
+# *c = 1.0 - x*x*0.5;
+ movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1.0
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x^2
+ subsd %xmm2,%xmm1
+ jmp .L__adjust_region
+
+.Lsinsmall: # then sin(r) = r
+ movsd .L__real_3ff0000000000000(%rip),%xmm1 # cos(r) is a 1
+ jmp .L__adjust_region
+
+# perform taylor series to calc sinx, cosx
+# COS
+# x2 = x * x;
+# return (1.0 - 0.5 * x2 + (x2 * x2 *
+# (c1 + x2 * (c2 + x2 * (c3 + x2 * c4)))));
+# x2 = x * x;
+# return (1.0 - 0.5 * x2 + (x2 * x2 *
+# (c1 + x2 * (c2 + x2 * (c3 + x2 * c4)))));
+# SIN
+# zc,zs = (c2 + x2 * (c3 + x2 * c4 ));
+# xs = r + x3 * (sc1 + x2 * zs);
+# x2 = x * x;
+# return (x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))));
+# done with reducing the argument. Now perform the sin/cos calculations.
+.align 16
+.L__sc_piby4:
+# x2 = r * r;
+ movsd .L__real_3fe0000000000000(%rip),%xmm5 # .5
+ movsd %xmm0,%xmm2
+ mulsd %xmm0,%xmm2 # x2
+ shufpd $0,%xmm2,%xmm2 # x2,x2
+ movsd %xmm2,%xmm4
+ mulsd %xmm4,%xmm4 # x4
+ shufpd $0,%xmm4,%xmm4 # x4,x4
+
+# x2m = _mm_set1_pd (x2);
+# zc,zs = (c2 + x2 * (c3 + x2 * c4 ));
+# xs = r + x3 * (sc1 + x2 * zs);
+# xc = t + ( x2 * x2 * (cc1 + x2 * zc));
+ movapd .Lcsarray+0x30(%rip),%xmm1 # c4
+ movapd .Lcsarray+0x10(%rip),%xmm3 # c2
+ mulpd %xmm2,%xmm1 # x2c4
+ mulpd %xmm2,%xmm3 # x2c2
+
+# rc = 0.5 * x2;
+ mulsd %xmm2,%xmm5 #rc
+ mulsd %xmm0,%xmm2 #x3
+
+ addpd .Lcsarray+0x20(%rip),%xmm1 # c3 + x2c4
+ addpd .Lcsarray(%rip),%xmm3 # c1 + x2c2
+ mulpd %xmm4,%xmm1 # x4(c3 + x2c4)
+ addpd %xmm3,%xmm1 # c1 + x2c2 + x4(c3 + x2c4)
+
+# -t = rc-1;
+ subsd .L__real_3ff0000000000000(%rip),%xmm5 # 1.0
+# now we have the poly for sin in the low half, and cos in upper half
+ mulsd %xmm1,%xmm2 # x3(sin poly)
+ shufpd $3,%xmm1,%xmm1 # get cos poly to low half of register
+ mulsd %xmm4,%xmm1 # x4(cos poly)
+
+ addsd %xmm2,%xmm0 # sin = r+...
+ subsd %xmm5,%xmm1 # cos = poly-(-t)
+
+.L__adjust_region: # xmm0 is sin, xmm1 is cos
+# switch (region)
+ mov %eax,%ecx
+ and $1,%eax
+ jz .Lregion02
+# region 1 or 3
+ movsd %xmm0,%xmm2 # swap sin,cos
+ movsd %xmm1,%xmm0 # sin = cos
+ xorpd %xmm1,%xmm1
+ subsd %xmm2,%xmm1 # cos = -sin
+
+.Lregion02:
+ and $2,%ecx
+ jz .Lregion23
+# region 2 or 3
+ movsd %xmm0,%xmm2
+ movsd %xmm1,%xmm3
+ xorpd %xmm0,%xmm0
+ xorpd %xmm1,%xmm1
+ subsd %xmm2,%xmm0 # sin = -sin
+ subsd %xmm3,%xmm1 # cos = -cos
+
+.Lregion23:
+## if (xneg) *s = -*s ;
+ cmp %r10,%rdx
+ jz .L__sc_cleanup
+ movsd %xmm0,%xmm2
+ xorpd %xmm0,%xmm0
+ subsd %xmm2,%xmm0 # sin = -sin
+
+.align 16
+.L__sc_cleanup:
+ cvtsd2ss %xmm0,%xmm0 # convert back to floats
+ cvtsd2ss %xmm1,%xmm1
+
+ movss %xmm0,(%rdi) # save the sin
+ movss %xmm1,(%rsi) # save the cos
+
+ add $stack_size,%rsp
+ ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsincosf_reduce_precise:
+# /* Reduce abs(x) into range [-pi/4,pi/4] */
+# __amd_remainder_piby2(ax, &r, ®ion);
+
+ mov %rdx,p_temp(%rsp) # save ux for use later
+ mov %r10,p_temp1(%rsp) # save ax for use later
+ mov %rdi,p_temp2(%rsp) # save ux for use later
+ mov %rsi,p_temp3(%rsp) # save ax for use later
+ movd %xmm0,%rdi
+ lea r(%rsp),%rsi
+ lea region(%rsp),%rdx
+ sub $0x040,%rsp
+
+ call __amd_remainder_piby2d2f@PLT
+
+ add $0x040,%rsp
+ mov p_temp(%rsp),%rdx # restore ux for use later
+ mov p_temp1(%rsp),%r10 # restore ax for use later
+ mov p_temp2(%rsp),%rdi # restore ux for use later
+ mov p_temp3(%rsp),%rsi # restore ax for use later
+
+ mov $1,%r8d # for determining region later on
+ movsd r(%rsp),%xmm0 # r
+ mov region(%rsp),%eax # region
+ jmp .L__sc_piby4
+
+.align 16
+.L__sc_naninf:
+ cvtsd2ss %xmm0,%xmm0 # convert back to floats
+ call fname_special # rdi and rsi are ready for the function call
+ add $stack_size, %rsp
+ ret
diff --git a/src/gas/sinf.S b/src/gas/sinf.S
new file mode 100644
index 0000000..c2083ff
--- /dev/null
+++ b/src/gas/sinf.S
@@ -0,0 +1,436 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+#
+# An implementation of the sinf function.
+#
+# Prototype:
+#
+# double sinf(double x);
+#
+# Computes sinf(x).
+# It will provide proper C99 return values,
+# but may not raise floating point status bits properly.
+# Based on the NAG C implementation.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 32
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0 # for alignment
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0
+.L__real_411E848000000000: .quad 0x415312d000000000 # 5e6 0x0411E848000000000 # 5e5
+ .quad 0
+
+.align 32
+.Lcosfarray:
+ .quad 0x0bfe0000000000000 # -0.5 c0
+ .quad 0
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0
+ .quad 0x0bf56c16c16c16c16 # -0.00138889 c2
+ .quad 0
+ .quad 0x03EFA01A01A01A019 # 2.48016e-005 c3
+ .quad 0
+ .quad 0x0be927e4fb7789f5c # -2.75573e-007 c4
+ .quad 0
+
+.align 32
+.Lsinfarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0
+ .quad 0x03f81111111111111 # 0.00833333 s2
+ .quad 0
+ .quad 0x0bf2a01a01a01a01a # -0.000198413 s3
+ .quad 0
+ .quad 0x03ec71de3a556c734 # 2.75573e-006 s4
+ .quad 0
+
+.text
+.align 32
+.p2align 4,,15
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(sinf)
+#define fname_special _sinf_special@PLT
+
+# define local variable storage offsets
+.equ p_temp, 0x30 # temporary for get/put bits operation
+.equ p_temp1, 0x40 # temporary for get/put bits operation
+.equ r, 0x50 # pointer to r for amd_remainder_piby2
+.equ region, 0x60 # pointer to region for amd_remainder_piby2
+.equ stack_size, 0x88
+
+.globl fname
+.type fname,@function
+
+fname:
+ sub $stack_size, %rsp
+ xorpd %xmm2, %xmm2 # zeroed out for later use
+
+## if NaN or inf
+ movd %xmm0, %edx
+ mov $0x07f800000, %eax
+ mov %eax, %r10d
+ and %edx, %r10d
+ cmp %eax, %r10d
+ jz .Lsinf_naninf
+
+# GET_BITS_DP64(x, ux);
+# get the input value to an integer register.
+ cvtss2sd %xmm0, %xmm0 # convert input to double.
+ movsd %xmm0,p_temp(%rsp) # get the input value to an integer register.
+
+ mov p_temp(%rsp), %rdx # rdx is ux
+
+# ax = (ux & ~SIGNBIT_DP64);
+ mov $0x07fffffffffffffff, %r10
+ and %rdx, %r10 # r10 is ax
+ mov $1, %r8d # for determining region later on
+
+## if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+ mov $0x03fe921fb54442d18, %rax
+ cmp %rax, %r10
+ jg .Lsinf_reduce
+
+## if (ax < 0x3f80000000000000) /* abs(x) < 2.0^(-7) */
+ mov $0x3f80000000000000, %rax
+ cmp %rax, %r10
+ jge .Lsinf_small
+
+## if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+ mov $0x3f20000000000000, %rax
+ cmp %rax, %r10
+ jge .Lsinf_smaller
+
+# sinf = x;
+ jmp .Lsinf_cleanup # done
+
+## else
+
+.Lsinf_smaller:
+# sinf = x - x^3 * 0.1666666666666666666;
+ movsd %xmm0, %xmm2
+ movsd .L__real_3fc5555555555555(%rip), %xmm4 # 0.1666666666666666666
+ mulsd %xmm2, %xmm2 # x^2
+ mulsd %xmm0, %xmm2 # x^3
+ mulsd %xmm4, %xmm2 # x^3 * 0.1666666666666666666
+ subsd %xmm2, %xmm0 # x - x^3 * 0.1666666666666666666
+ jmp .Lsinf_cleanup
+
+.Lsinf_small:
+ movsd %xmm0, %xmm2
+ mulsd %xmm0, %xmm2 # x2
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2 - do a sinf calculation
+# zs = x + x3((s1 + x2 * s2) + x4(s3 + x2 * s4));
+ movsd .Lsinfarray+0x30(%rip), %xmm1 # s4
+ mulsd %xmm2, %xmm1 # s4x2
+ movsd %xmm2, %xmm4 # move for x4
+ movsd .Lsinfarray+0x10(%rip), %xmm5 # s2
+ mulsd %xmm2, %xmm4 # x4
+ movsd %xmm0, %xmm3 # move for x3
+ mulsd %xmm2, %xmm5 # s2x2
+ mulsd %xmm2, %xmm3 # x3
+ addsd .Lsinfarray+0x20(%rip), %xmm1 # s3+s4x2
+ mulsd %xmm4, %xmm1 # s3x4+s4x6
+ addsd .Lsinfarray(%rip), %xmm5 # s1+s2x2
+ addsd %xmm5, %xmm1 # s1+s2x2+s3x4+s4x6
+ mulsd %xmm3, %xmm1 # x3(s1+s2x2+s3x4+s4x6)
+ addsd %xmm1, %xmm0 # x + x3(s1+s2x2+s3x4+s4x6)
+ jmp .Lsinf_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 32
+.Lsinf_reduce:
+
+# xneg = (ax != ux);
+ cmp %r10, %rdx
+ mov $0, %r11d
+
+## if (xneg) x = -x;
+ jz .L50e5
+ mov $1, %r11d
+ subsd %xmm0, %xmm2
+ movsd %xmm2, %xmm0
+
+.L50e5:
+## if (x < 5.0e5)
+ cmp .L__real_411E848000000000(%rip), %r10
+ jae .Lsinf_reduce_precise
+
+# reduce the argument to be in a range from -pi/4 to +pi/4
+# by subtracting multiples of pi/2
+ movsd %xmm0, %xmm2
+ movsd .L__real_3fe45f306dc9c883(%rip), %xmm3 # twobypi
+ movsd %xmm0, %xmm4
+ movsd .L__real_3fe0000000000000(%rip), %xmm5 # .5
+ mulsd %xmm3, %xmm2
+
+#/* How many pi/2 is x a multiple of? */
+# xexp = ax >> EXPSHIFTBITS_DP64;
+ mov %r10, %r9
+ shr $52, %r9 #>>EXPSHIFTBITS_DP64
+
+# npi2 = (int)(x * twobypi + 0.5);
+ addsd %xmm5, %xmm2 # npi2
+
+ movsd .L__real_3ff921fb54400000(%rip), %xmm3 # piby2_1
+ cvttpd2dq %xmm2, %xmm0 # convert to integer
+ movsd .L__real_3dd0b4611a626331(%rip), %xmm1 # piby2_1tail
+ cvtdq2pd %xmm0, %xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+# rhead = x - npi2 * piby2_1;
+ mulsd %xmm2, %xmm3
+ subsd %xmm3, %xmm4 # rhead
+
+# rtail = npi2 * piby2_1tail;
+ mulsd %xmm2, %xmm1
+ movd %xmm0, %eax
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+ movsd %xmm4, %xmm0
+ subsd %xmm1, %xmm0
+
+ movsd .L__real_3dd0b4611a600000(%rip), %xmm3 # piby2_2
+ movsd %xmm0,p_temp(%rsp)
+ movsd .L__real_3ba3198a2e037073(%rip), %xmm5 # piby2_2tail
+ mov p_temp(%rsp), %rcx # rcx is rhead-rtail
+
+# xmm0=r, xmm4=rhead, xmm1=rtail, xmm2=npi2, xmm3=temp for calc, xmm5= temp for calc
+# expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ shl $1, %rcx # strip any sign bit
+ shr $53, %rcx #>> EXPSHIFTBITS_DP64 +1
+ sub %rcx, %r9 #expdiff
+
+## if (expdiff > 15)
+ cmp $15, %r9
+ jle .Lexpdiff15
+
+# /* The remainder is pretty small compared with x, which
+# implies that x is a near multiple of pi/2
+# (x matches the multiple to at least 15 bits) */
+
+# t = rhead;
+ movsd %xmm4, %xmm1
+
+# rtail = npi2 * piby2_2;
+ mulsd %xmm2, %xmm3
+
+# rhead = t - rtail;
+ mulsd %xmm2, %xmm5 # npi2 * piby2_2tail
+ subsd %xmm3, %xmm4 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ subsd %xmm4, %xmm1 # t - rhead
+ subsd %xmm3, %xmm1 # -rtail
+ subsd %xmm1, %xmm5 #rtail
+
+# r = rhead - rtail;
+ movsd %xmm4, %xmm0
+
+#HARSHA
+#xmm1=rtail
+ movsd %xmm5, %xmm1
+ subsd %xmm5, %xmm0
+
+# xmm0=r, xmm4=rhead, xmm1=rtail
+.Lexpdiff15:
+# region = npi2 & 3;
+# No need rr for float case
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+## if the input was close to a pi/2 multiple
+# The original NAG code missed this trick. If the input is very close to n*pi/2 after
+# reduction,
+# then the sinf is ~ 1.0 , to within 15 bits, when r is < 2^-13. We already
+# have x at this point, so we can skip the sinf polynomials.
+
+ cmp $0x03f2, %rcx ## if r small.
+ jge .Lsinf_piby4 # use taylor series if not
+ cmp $0x03de, %rcx ## if r really small.
+ jle .Lr_small # then sinf(r) = 0
+
+ movsd %xmm0, %xmm2
+ mulsd %xmm2, %xmm2 #x^2
+
+## if region is 0 or 2 do a sinf calc.
+ and %eax, %r8d
+ jnz .Lcosfregion
+
+# region 0 or 2 do a sinf calculation
+# use simply polynomial
+# x - x*x*x*0.166666666666666666;
+ movsd .L__real_3fc5555555555555(%rip), %xmm3 #
+ mulsd %xmm0, %xmm3 # * x
+ mulsd %xmm2, %xmm3 # * x^2
+ subsd %xmm3, %xmm0 # xs
+ jmp .Ladjust_region
+
+.align 32
+.Lcosfregion:
+# region 1 or 3 do a cosf calculation
+# use simply polynomial
+# 1.0 - x*x*0.5;
+ movsd .L__real_3ff0000000000000(%rip), %xmm0 # 1.0
+ mulsd .L__real_3fe0000000000000(%rip), %xmm2 # 0.5 *x^2
+ subsd %xmm2, %xmm0 # xc
+ jmp .Ladjust_region
+
+.align 32
+.Lr_small:
+## if region is 1 or 3 do a cosf calc.
+ and %eax, %r8d
+ jz .Ladjust_region
+
+# odd
+ movsd .L__real_3ff0000000000000(%rip), %xmm0 # cosf(r) is a 1
+ jmp .Ladjust_region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsinf_reduce_precise:
+# // Reduce x into range [-pi/4,pi/4]
+# __amd_remainder_piby2d2f(x, &r, ®ion);
+
+ mov %r11,p_temp(%rsp)
+ lea region(%rsp), %rdx
+ lea r(%rsp), %rsi
+ movd %xmm0, %rdi
+ sub $0x20, %rsp
+
+ call __amd_remainder_piby2d2f@PLT
+
+ add $0x20, %rsp
+ mov p_temp(%rsp), %r11
+ mov $1, %r8d # for determining region later on
+ movsd r(%rsp), %xmm1 #//x
+ mov region(%rsp), %eax #//region
+
+# xmm0 = x, xmm4 = xx, r8d = 1, eax= region
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# perform taylor series to calc sinfx, cosfx
+.Lsinf_piby4:
+# x2 = r * r;
+ movsd %xmm0, %xmm2
+ mulsd %xmm0, %xmm2 #x2
+
+## if region is 0 or 2 do a sinf calc.
+ and %eax, %r8d
+ jnz .Lcosfregion2
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 0 or 2 do a sinf calculation
+# zs = x + x3((s1 + x2 * s2) + x4(s3 + x2 * s4));
+ movsd .Lsinfarray+0x30(%rip), %xmm1 # s4
+ mulsd %xmm2, %xmm1 # s4x2
+ movsd %xmm2, %xmm4 # move for x4
+ mulsd %xmm2, %xmm4 # x4
+ movsd .Lsinfarray+0x10(%rip), %xmm5 # s2
+ mulsd %xmm2, %xmm5 # s2x2
+ movsd %xmm0, %xmm3 # move for x3
+ mulsd %xmm2, %xmm3 # x3
+ addsd .Lsinfarray+0x20(%rip), %xmm1 # s3+s4x2
+ mulsd %xmm4, %xmm1 # s3x4+s4x6
+ addsd .Lsinfarray(%rip), %xmm5 # s1+s2x2
+ addsd %xmm5, %xmm1 # s1+s2x2+s3x4+s4x6
+ mulsd %xmm3, %xmm1 # x3(s1+s2x2+s3x4+s4x6)
+ addsd %xmm1, %xmm0 # x + x3(s1+s2x2+s3x4+s4x6)
+
+ jmp .Ladjust_region
+
+.align 32
+.Lcosfregion2:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# region 1 or 3 - do a cosf calculation
+# zc = 1-0.5*x2+ c1*x4 +c2*x6 +c3*x8 + c4*x10 for a higher precision
+ movsd .Lcosfarray+0x40(%rip), %xmm1 # c4
+ movsd %xmm2, %xmm4 # move for x4
+ mulsd %xmm2, %xmm1 # c4x2
+ movsd .Lcosfarray+0x20(%rip), %xmm3 # c2
+ mulsd %xmm2, %xmm4 # x4
+ movsd .Lcosfarray(%rip), %xmm0 # c0
+ mulsd %xmm2, %xmm3 # c2x2
+ mulsd %xmm2, %xmm0 # c0x2 (=-0.5x2)
+ addsd .Lcosfarray+0x30(%rip), %xmm1 # c3+c4x2
+ mulsd %xmm4, %xmm1 # c3x4 + c4x6
+ addsd .Lcosfarray+0x10(%rip), %xmm3 # c1+c2x2
+ addsd %xmm3, %xmm1 # c1 + c2x2 + c3x4 + c4x6
+ mulsd %xmm4, %xmm1 # c1x4 + c2x6 + c3x8 + c4x10
+ addsd .L__real_3ff0000000000000(%rip), %xmm0 # 1 - 0.5x2
+ addsd %xmm1, %xmm0 # 1 - 0.5x2 + c1x4 + c2x6 + c3x8 + c4x10
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 32
+.Ladjust_region: # positive or negative
+# switch (region)
+ shr $1, %eax
+ mov %eax, %ecx
+ and %r11d, %eax
+
+ not %ecx
+ not %r11d
+ and %r11d, %ecx
+
+ or %ecx, %eax
+ and $1, %eax
+ jnz .Lsinf_cleanup
+
+## if the original region 0, 1 and arg is negative, then we negate the result.
+## if the original region 2, 3 and arg is positive, then we negate the result.
+ movsd %xmm0, %xmm2
+ xorpd %xmm0, %xmm0
+ subsd %xmm2, %xmm0
+
+.align 32
+.Lsinf_cleanup:
+ cvtsd2ss %xmm0, %xmm0
+ add $stack_size, %rsp
+ ret
+
+.align 32
+.Lsinf_naninf:
+ call fname_special
+ add $stack_size, %rsp
+ ret
+
+
diff --git a/src/gas/trunc.S b/src/gas/trunc.S
new file mode 100644
index 0000000..c29d0fd
--- /dev/null
+++ b/src/gas/trunc.S
@@ -0,0 +1,87 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# trunc.S
+#
+# An implementation of the trunc libm function.
+#
+# The trunc functions round their argument to the integer value, in floating format,
+# nearest to but no larger in magnitude than the argument.
+#
+#
+# Prototype:
+#
+# double trunc(double x);
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(trunc)
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+ MOVAPD %xmm0,%xmm1
+
+#convert double to integer.
+ CVTTSD2SIQ %xmm0,%rax
+ CMP .L__Erro_mask(%rip),%rax
+ jz .Error_val
+#convert integer to double
+ CVTSI2SDQ %rax,%xmm0
+
+ PSRLQ $63,%xmm1
+ PSLLQ $63,%xmm1
+
+ POR %xmm1,%xmm0
+
+
+ ret
+
+.Error_val:
+ MOVAPD %xmm1,%xmm2
+ CMPEQSD %xmm1,%xmm1
+ ADDSD %xmm2,%xmm2
+
+ PAND %xmm1,%xmm0
+ PANDN %xmm2,%xmm1
+ POR %xmm1,%xmm0
+
+
+ ret
+
+.data
+.align 16
+.L__Erro_mask: .quad 0x8000000000000000
+ .quad 0x0
diff --git a/src/gas/truncf.S b/src/gas/truncf.S
new file mode 100644
index 0000000..c73ad8f
--- /dev/null
+++ b/src/gas/truncf.S
@@ -0,0 +1,93 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+# truncf.S
+#
+# An implementation of the truncf libm function.
+#
+#
+# The trunf functions round their argument to the integer value, in floating format,
+# nearest to but no larger in magnitude than the argument.
+#
+#
+# Prototype:
+#
+# float truncf(float x);
+#
+
+#
+# Algorithm:
+#
+
+#include "fn_macros.h"
+#define fname FN_PROTOTYPE(truncf )
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.text
+.align 16
+.p2align 4,,15
+.globl fname
+.type fname,@function
+fname:
+
+
+ MOVAPD %xmm0,%xmm1
+
+# convert float to integer.
+ CVTTSS2SIQ %xmm0,%rax
+
+ CMP .L__Erro_mask(%rip),%rax
+ jz .Error_val
+
+# convert integer to float
+ CVTSI2SSQ %rax,%xmm0
+
+ PSRLD $31,%xmm1
+ PSLLD $31,%xmm1
+
+ POR %xmm1,%xmm0
+
+
+ ret
+
+.Error_val:
+ MOVAPD %xmm1,%xmm2
+ CMPEQSS %xmm1,%xmm1
+ ADDSS %xmm2,%xmm2
+
+ PAND %xmm1,%xmm0
+ PANDN %xmm2,%xmm1
+ POR %xmm1,%xmm0
+
+
+
+
+ ret
+
+.data
+.align 16
+.L__Erro_mask: .quad 0x8000000000000000
+ .quad 0x0
diff --git a/src/gas/v4hcosl.S b/src/gas/v4hcosl.S
new file mode 100644
index 0000000..a3ded17
--- /dev/null
+++ b/src/gas/v4hcosl.S
@@ -0,0 +1,62 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hcosl.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+# void v4cos(__m128d x1, __m128d x2, double * ya);
+#
+# Computes 4 cos values simultaneously and returns them
+# in the v4a array.
+# Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi - double *ya
+
+.extern __vrd4_cos
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v4cos
+ .type v4cos,@function
+v4cos:
+ push %rdi
+ call __vrd4_cos@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+ ret
diff --git a/src/gas/v4helpl.S b/src/gas/v4helpl.S
new file mode 100644
index 0000000..02fa080
--- /dev/null
+++ b/src/gas/v4helpl.S
@@ -0,0 +1,83 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4help.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+# void v4exp(__m128d x1, __m128d x2, double * ya);
+#
+# Computes 4 exp values simultaneously and returns them
+# in the v4a array.
+# Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# %xmm0 - __m128d x1
+# %xmm1 - __m128d x2
+# rdi - double *ya
+
+.extern __vrd4_exp
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v4exp
+ .type v4exp,@function
+v4exp:
+ push %rdi
+ call __vrd4_exp@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+ ret
+
+
+# %xmm0,%rcx - __m128d x1
+# %xmm1,%rdx - __m128d x2
+# r8 - double *ya
+
+.extern __vrs8_expf
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v8expf
+ .type v8expf,@function
+v8expf:
+ push %rdi
+ call __vrs8_expf@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+ ret
+
diff --git a/src/gas/v4hfrcpal.S b/src/gas/v4hfrcpal.S
new file mode 100644
index 0000000..d648d9d
--- /dev/null
+++ b/src/gas/v4hfrcpal.S
@@ -0,0 +1,63 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hfrcpal.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+# void v4frcpa(__m128d x1, __m128d x2, double * ya);
+#
+# Computes 4 frcpa values simultaneously and returns them
+# in the v4a array.
+# Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi - double *ya
+
+.extern __vrd4_frcpa
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v4frcpa
+ .type v4frcpa,@function
+v4frcpa:
+ push %rdi
+ call __vrd4_frcpa@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+ ret
+
diff --git a/src/gas/v4hlog10l.S b/src/gas/v4hlog10l.S
new file mode 100644
index 0000000..0cdb6ba
--- /dev/null
+++ b/src/gas/v4hlog10l.S
@@ -0,0 +1,81 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hlog10l.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+# void v4log10(__m128d x1, __m128d x2, double * ya);
+#
+# Computes 4 log10 values simultaneously and returns them
+# in the v4a array.
+# Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi - double *ya
+
+.extern __vrd4_log10
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v4log10
+ .type v4log10,@function
+v4log10:
+ push %rdi
+ call __vrd4_log10@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+ ret
+
+# xmm0 - __m128 x1
+# xmm1 - __m128 x2
+# rdi - single *ya
+
+.extern __vrs8_log10f
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v8log10f
+ .type v8log10f,@function
+v8log10f:
+ push %rdi
+ call __vrs8_log10f@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+
+ ret
diff --git a/src/gas/v4hlog2l.S b/src/gas/v4hlog2l.S
new file mode 100644
index 0000000..1a8c33e
--- /dev/null
+++ b/src/gas/v4hlog2l.S
@@ -0,0 +1,81 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hlog10l.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+# void v4log2(__m128d x1, __m128d x2, double * ya);
+#
+# Computes 4 log2 values simultaneously and returns them
+# in the v4a array.
+# Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi - double *ya
+
+.extern __vrd4_log2
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v4log2
+ .type v4log2,@function
+v4log2:
+ push %rdi
+ call __vrd4_log2@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+ ret
+
+# xmm0 - __m128 x1
+# xmm1 - __m128 x2
+# rdi - single *ya
+
+.extern __vrs8_log2f
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v8log2f
+ .type v8log2f,@function
+v8log2f:
+ push %rdi
+ call __vrs8_log2f@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+
+ ret
diff --git a/src/gas/v4hlogl.S b/src/gas/v4hlogl.S
new file mode 100644
index 0000000..512648d
--- /dev/null
+++ b/src/gas/v4hlogl.S
@@ -0,0 +1,84 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hlog.asm
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+# void v4log(__m128d x1, __m128d x2, double * ya);
+#
+# Computes 4 log values simultaneously and returns them
+# in the v4a array.
+# Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi - double *ya
+
+.extern __vrd4_log
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v4log
+ .type v4log,@function
+v4log:
+ push %rdi
+ call __vrd4_log@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+ ret
+
+
+# xmm0 - __m128 x1
+# xmm1 - __m128 x2
+# rdi - double *ya
+
+#.extern __vrs8_logf
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v8logf
+ .type v8logf,@function
+v8logf:
+ push %rdi
+ call __vrs8_logf@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+
+ ret
+
diff --git a/src/gas/v4hsinl.S b/src/gas/v4hsinl.S
new file mode 100644
index 0000000..97bfa2d
--- /dev/null
+++ b/src/gas/v4hsinl.S
@@ -0,0 +1,62 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# v4hsinl.s
+#
+# Helper routines for testing the x4 double and x8 single vector
+# math functions.
+#
+# Prototype:
+#
+# void v4sin(__m128d x1, __m128d x2, double * ya);
+#
+# Computes 4 sin values simultaneously and returns them
+# in the v4a array.
+# Assumes that ya is 16 byte aligned.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+# rdi - double *ya
+
+.extern __vrd4_sin
+ .text
+ .align 16
+ .p2align 4,,15
+.globl v4sin
+ .type v4sin,@function
+v4sin:
+ push %rdi
+ call __vrd4_sin@PLT
+ pop %rdi
+ movdqa %xmm0,(%rdi)
+ movdqa %xmm1,16(%rdi)
+ ret
diff --git a/src/gas/vrd2cos.S b/src/gas/vrd2cos.S
new file mode 100644
index 0000000..d12a156
--- /dev/null
+++ b/src/gas/vrd2cos.S
@@ -0,0 +1,756 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# A vector implementation of the libm cos function.
+#
+# Prototype:
+#
+# __m128d __vrd2_cos(__m128d x);
+#
+# Computes Cosine of x
+# It will provide proper C99 return values,
+# but may not raise floating point status bits properly.
+# Based on the NAG C implementation.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+
+.Lcosarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03fa5555555555555
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6
+ .quad 0x0bda907db46cc5e42
+.Lsinarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bfc5555555555555
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03f81111111110bb3
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0bf2a01a019e83e5c
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03ec71de3796cde01
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0be5ae600b42fdfa7
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x03de5e0b2f9a43bb8
+.Lsincosarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x0bf56c16c16c16967 # c2
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x0bda907db46cc5e42
+.Lcossinarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bf56c16c16c16967 # c2
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0bda907db46cc5e42
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+
+.text
+.align 16
+.p2align 4,,15
+
+.equ p_temp, 0x00 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for get/put bits operation
+.equ p_temp2,0x20 # temporary for get/put bits operation
+.equ p_xmm6, 0x30 # temporary for get/put bits operation
+.equ p_xmm7, 0x40 # temporary for get/put bits operation
+.equ p_xmm8, 0x50 # temporary for get/put bits operation
+.equ p_xmm9, 0x60 # temporary for get/put bits operation
+.equ p_xmm10,0x70 # temporary for get/put bits operation
+.equ p_xmm11,0x80 # temporary for get/put bits operation
+.equ p_xmm12,0x90 # temporary for get/put bits operation
+.equ p_xmm13,0x0A0 # temporary for get/put bits operation
+.equ p_xmm14,0x0B0 # temporary for get/put bits operation
+.equ p_xmm15,0x0C0 # temporary for get/put bits operation
+.equ r, 0x0D0 # pointer to r for remainder_piby2
+.equ rr, 0x0E0 # pointer to r for remainder_piby2
+.equ region, 0x0F0 # pointer to r for remainder_piby2
+.equ p_original,0x100 # original x
+.equ p_mask, 0x110 # original x
+.equ p_sign, 0x120 # original x
+
+.globl __vrd2_cos
+ .type __vrd2_cos,@function
+__vrd2_cos:
+ sub $0x138,%rsp
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+movdqa %xmm0, p_original(%rsp)
+andpd .L__real_7fffffffffffffff(%rip),%xmm0
+movdqa %xmm0, p_temp(%rsp)
+mov $0x3FE921FB54442D18,%rdx #piby4
+mov $0x411E848000000000,%r10 #5e5
+movapd .L__real_v2p__27(%rip),%xmm4 #for later use
+
+movapd %xmm0,%xmm2 #x
+movapd %xmm0,%xmm4 #x
+
+mov p_temp(%rsp),%rax #rax = lower arg
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+movapd .L__real_3fe0000000000000(%rip),%xmm5 #0.5 for later use
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax #is lower arg >= 5e5
+ jae .Llower_or_both_arg_gt_5e5
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lupper_arg_gt_5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lboth_arg_lt_than_5e5:
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ movapd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3=piby2_1
+ addpd %xmm5,%xmm2 # xmm2 = npi2 = x*twobypi+0.5
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1=piby2_2
+ movapd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6=piby2_2tail
+ cvttpd2dq %xmm2,%xmm0 # xmm0=convert npi2 to ints
+ cvtdq2pd %xmm0,%xmm2 # xmm2=and back to double.
+
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm3 # npi2 * piby2_1
+ subpd %xmm3,%xmm4 # xmm4 = rhead=x-npi2*piby2_1
+
+#t = rhead;
+ movapd %xmm4,%xmm5 # xmm5=t=rhead
+
+#rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm1 # xmm1= npi2*piby2_2
+
+#rhead = t - rtail;
+ subpd %xmm1,%xmm4 # xmm4= rhead = t-rtail
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd %xmm2,%xmm6 # npi2 * piby2_2tail
+ subpd %xmm4,%xmm5 # t-rhead
+ subpd %xmm5,%xmm1 # rtail-(t - rhead)
+ addpd %xmm6,%xmm1 # rtail=npi2*piby2_2+(rtail-(t-rhead))
+
+#r = rhead - rtail
+#rr=(rhead-r) -rtail
+#Sign
+#Region
+ movdqa %xmm0,%xmm5 # Sign
+ movdqa %xmm0,%xmm6 # Region
+ movdqa %xmm4,%xmm0 # rhead (handle xmm0 retype)
+
+ paddd .L__reald_one_one(%rip),%xmm6 # Sign
+ pand .L__reald_two_two(%rip),%xmm6
+ punpckldq %xmm6,%xmm6
+ psllq $62,%xmm6 # xmm6 is in Int format
+
+ subpd %xmm1,%xmm0 # rhead - rtail
+ pand .L__reald_one_one(%rip),%xmm5 # Odd/Even region for Sin/Cos
+ mov .L__reald_one_zero(%rip),%r9 # Compare value for sincos
+ subpd %xmm0,%xmm4 # rr=rhead-r
+ movd %xmm5,%r8 # Region
+ movapd %xmm0,%xmm2 # Move for x2
+ movdqa %xmm6,%xmm6 # handle xmm6 retype
+ mulpd %xmm0,%xmm2 # x2
+ subpd %xmm1,%xmm4 # rr=(rhead-r) -rtail
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign
+
+.align 16
+.L__vrd2_cos_approximate:
+ cmp $0,%r8
+ jnz .Lvrd2_not_cos_piby4
+
+.Lvrd2_cos_piby4:
+ mulpd %xmm0,%xmm4 # x*xx
+ movdqa .L__real_3fe0000000000000(%rip),%xmm5 # 0.5 (handle xmm5 retype)
+ movapd .Lcosarray+0x50(%rip),%xmm1 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm0 # c3
+ mulpd %xmm2,%xmm5 # r = 0.5 *x2
+ movapd %xmm2,%xmm3 # copy of x2 for x4
+ movapd %xmm4,p_temp(%rsp) # store x*xx
+ mulpd %xmm2,%xmm1 # c6*x2
+ mulpd %xmm2,%xmm0 # c3*x2
+ subpd .L__real_3ff0000000000000(%rip),%xmm5 # -t=r-1.0
+ mulpd %xmm2,%xmm3 # x4
+ addpd .Lcosarray+0x40(%rip),%xmm1 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm0 # c2+x2C3
+ addpd .L__real_3ff0000000000000(%rip),%xmm5 # 1 + (-t)
+ mulpd %xmm2,%xmm3 # x6
+ mulpd %xmm2,%xmm1 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm0 # x2(c2+x2C3)
+ movapd %xmm2,%xmm4 # copy of x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm4 # r = 0.5 *x2
+ addpd .Lcosarray+0x30(%rip),%xmm1 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm0 # c1+x2(c2+x2C3)
+ mulpd %xmm2,%xmm2 # x4
+ subpd %xmm4,%xmm5 # (1 + (-t)) - r
+ mulpd %xmm3,%xmm1 # x6(c4 + x2(c5+x2c6))
+ addpd %xmm1,%xmm0 # zc
+ subpd .L__real_3ff0000000000000(%rip),%xmm4 # -t=r-1.0
+ subpd p_temp(%rsp),%xmm5 # ((1 + (-t)) - r) - x*xx
+ mulpd %xmm2,%xmm0 # x4 * zc
+ addpd %xmm5,%xmm0 # x4 * zc + ((1 + (-t)) - r -x*xx)
+ subpd %xmm4,%xmm0 # result - (-t)
+ xorpd %xmm6,%xmm0 # xor with sign
+ jmp .L__vrd2_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lvrd2_not_cos_piby4:
+ cmp $1,%r8
+ jnz .Lvrd2_not_cos_sin_piby4
+
+.Lvrd2_cos_sin_piby4:
+
+ movdqa %xmm6,p_temp1(%rsp) # Store Sign
+ movapd %xmm4,p_temp(%rsp) # Store rr
+
+ movapd .Lsincosarray+0x50(%rip),%xmm3 # s6
+ mulpd %xmm2,%xmm3 # x2s6
+ movdqa .Lsincosarray+0x20(%rip),%xmm5 # s3 (handle xmm5 retype)
+ movapd %xmm2,%xmm1 # move x2 for x4
+ mulpd %xmm2,%xmm1 # x4
+ mulpd %xmm2,%xmm5 # x2s3
+ addpd .Lsincosarray+0x40(%rip),%xmm3 # s5+x2s6
+ movapd %xmm2,%xmm4 # move x2 for x6
+ mulpd %xmm2,%xmm3 # x2(s5+x2s6)
+ mulpd %xmm1,%xmm4 # x6
+ addpd .Lsincosarray+0x10(%rip),%xmm5 # s2+x2s3
+ mulpd %xmm2,%xmm5 # x2(s2+x2s3)
+ addpd .Lsincosarray+0x30(%rip),%xmm3 # s4+x2(s5+x2s6)
+
+ movhlps %xmm1,%xmm1 # move high x4 for cos
+ mulpd %xmm4,%xmm3 # x6(s4+x2(s5+x2s6))
+ addpd .Lsincosarray(%rip),%xmm5 # s1+x2(s2+x2s3)
+ movapd %xmm2,%xmm4 # move low x2 for x3
+ mulsd %xmm0,%xmm4 # get low x3 for sin term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2
+
+ addpd %xmm3,%xmm5 # z
+ movhlps %xmm2,%xmm6 # move high r for cos
+ movhlps %xmm5,%xmm3 # xmm5 = sin
+ # xmm3 = cos
+
+
+ mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx
+
+ mulsd %xmm4,%xmm5 # sin *x3
+ movsd .L__real_3ff0000000000000(%rip),%xmm4 # 1.0
+ mulsd %xmm1,%xmm3 # cos *x4
+ subsd %xmm6,%xmm4 # t=1.0-r
+
+ movhlps %xmm0,%xmm1
+ subsd %xmm2,%xmm5 # sin - 0.5 * x2 *xx
+
+ mulsd p_temp+8(%rsp),%xmm1 # x * xx
+ movsd .L__real_3ff0000000000000(%rip),%xmm2 # 1
+ subsd %xmm4,%xmm2 # 1 - t
+ addsd p_temp(%rsp),%xmm5 # sin+xx
+
+ subsd %xmm6,%xmm2 # (1-t) - r
+ subsd %xmm1,%xmm2 # ((1 + (-t)) - r) - x*xx
+ addsd %xmm5,%xmm0 # sin + x
+ addsd %xmm2,%xmm3 # cos+((1-t)-r - x*xx)
+ addsd %xmm4,%xmm3 # cos+t
+
+ movapd p_temp1(%rsp),%xmm5 # load sign
+ movlhps %xmm3,%xmm0
+ xorpd %xmm5,%xmm0
+ jmp .L__vrd2_cos_cleanup
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lvrd2_not_cos_sin_piby4:
+ cmp %r9,%r8
+ jnz .Lvrd2_sin_piby4
+
+.Lvrd2_sin_cos_piby4:
+
+ movapd %xmm4,p_temp(%rsp) # rr move to to memory
+ movapd %xmm0,p_temp1(%rsp) # r move to to memory
+ movapd %xmm6,p_sign(%rsp)
+
+ movapd .Lcossinarray+0x50(%rip),%xmm3 # s6
+ mulpd %xmm2,%xmm3 # x2s6
+ movdqa .Lcossinarray+0x20(%rip),%xmm5 # s3
+ movapd %xmm2,%xmm1 # move x2 for x4
+ mulpd %xmm2,%xmm1 # x4
+ mulpd %xmm2,%xmm5 # x2s3
+
+ addpd .Lcossinarray+0x40(%rip),%xmm3 # s5+x2s6
+ movapd %xmm2,%xmm4 # move for x6
+ mulpd %xmm2,%xmm3 # x2(s5+x2s6)
+ mulpd %xmm1,%xmm4 # x6
+ addpd .Lcossinarray+0x10(%rip),%xmm5 # s2+x2s3
+ mulpd %xmm2,%xmm5 # x2(s2+x2s3)
+ addpd .Lcossinarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6)
+
+ movhlps %xmm0,%xmm0 # high of x for x3
+ mulpd %xmm4,%xmm3 # x6(s4 + x2(s5+x2s6))
+ addpd .Lcossinarray(%rip),%xmm5 # s1+x2(s2+x2s3)
+
+ movhlps %xmm2,%xmm4 # high of x2 for x3
+
+ addpd %xmm5,%xmm3 # z
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2
+ mulsd %xmm0,%xmm4 # x3 #
+ movhlps %xmm3,%xmm5 # xmm5 = sin
+ # xmm3 = cos
+
+ mulsd %xmm4,%xmm5 # sin*x3 #
+ movsd .L__real_3ff0000000000000(%rip),%xmm4 # 1.0 #
+ mulsd %xmm1,%xmm3 # cos*x4 #
+
+ subsd %xmm2,%xmm4 # t=1.0-r #
+
+ movhlps %xmm2,%xmm6 # move 0.5 * x2 for 0.5 * x2 * xx #
+ mulsd p_temp+8(%rsp),%xmm6 # 0.5 * x2 * xx #
+ subsd %xmm6,%xmm5 # sin - 0.5 * x2 *xx #
+ addsd p_temp+8(%rsp),%xmm5 # sin+xx #
+
+ movlpd p_temp1(%rsp),%xmm6 # x
+ mulsd p_temp(%rsp),%xmm6 # x *xx #
+
+ movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1 #
+ subsd %xmm4,%xmm1 # 1 -t #
+ addsd %xmm5,%xmm0 # sin+x #
+ subsd %xmm2,%xmm1 # (1-t) - r #
+ subsd %xmm6,%xmm1 # ((1 + (-t)) - r) - x*xx #
+ addsd %xmm1,%xmm3 # cos+((1 + (-t)) - r) - x*xx #
+ addsd %xmm4,%xmm3 # cos+t #
+
+ movapd p_sign(%rsp),%xmm2 # load sign
+ movlhps %xmm0,%xmm3
+ movapd %xmm3,%xmm0
+ xorpd %xmm2,%xmm0
+ jmp .L__vrd2_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lvrd2_sin_piby4:
+ movapd .Lsinarray+0x50(%rip),%xmm3 # s6
+ mulpd %xmm2,%xmm3 # x2s6
+ movapd .Lsinarray+0x20(%rip),%xmm5 # s3
+ movapd %xmm4,p_temp(%rsp) # store xx
+ movapd %xmm2,%xmm1 # move for x4
+ mulpd %xmm2,%xmm1 # x4
+ movapd %xmm0,p_temp1(%rsp) # store x
+
+ mulpd %xmm2,%xmm5 # x2s3
+ movapd %xmm0,%xmm4 # move for x3
+ addpd .Lsinarray+0x40(%rip),%xmm3 # s5+x2s6
+ mulpd %xmm2,%xmm1 # x6
+ mulpd %xmm2,%xmm3 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm4 # x3
+ addpd .Lsinarray+0x10(%rip),%xmm5 # s2+x2s3
+ mulpd %xmm2,%xmm5 # x2(s2+x2s3)
+ addpd .Lsinarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2
+
+ movapd p_temp(%rsp),%xmm0 # load xx
+ mulpd %xmm1,%xmm3 # x6(s4 + x2(s5+x2s6))
+ addpd .Lsinarray(%rip),%xmm5 # s1+x2(s2+x2s3)
+ mulpd %xmm0,%xmm2 # 0.5 * x2 *xx
+ addpd %xmm5,%xmm3 # zs
+ mulpd %xmm3,%xmm4 # *x3
+ subpd %xmm2,%xmm4 # x3*zs - 0.5 * x2 *xx
+ addpd %xmm4,%xmm0 # +xx
+ addpd p_temp1(%rsp),%xmm0 # +x
+
+ xorpd %xmm6,%xmm0 # xor sign
+ jmp .L__vrd2_cos_cleanup
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Llower_or_both_arg_gt_5e5:
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+
+ movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm4,%xmm4
+
+# Work on Upper arg
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+
+#If upper Arg is <=piby4
+ cmp %rdx,%rcx # is upper arg > piby4
+ ja 0f
+
+ mov $0,%ecx # region = 0
+ mov %ecx,region+4(%rsp) # store upper region
+ movlpd %xmm0,r+8(%rsp) # store upper r
+ xorpd %xmm4,%xmm4 # rr = 0
+ movlpd %xmm4,rr+8(%rsp) # store upper rr
+ jmp .Lcheck_lower_arg
+
+.align 16
+0:
+#If upper Arg is > piby4
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm5,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%ecx # xmm0 = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm3 # npi2 * piby2_1
+ subsd %xmm3,%xmm4 # xmm4 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm4,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm1 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm1,%xmm4 # xmm4 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm4,%xmm5 # t-rhead
+ subsd %xmm5,%xmm1 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm4,%xmm0
+ subsd %xmm1,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm4 # rr=rhead-r
+ subsd %xmm1,%xmm4 # xmm4 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r+8(%rsp) # store upper r
+ movlpd %xmm4,rr+8(%rsp) # store upper rr
+
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_lower_arg:
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd2_cos_lower_naninf
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd2_cos_lower_naninf:
+ mov p_original(%rsp),%rax # upper arg is nan/inf
+
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) # rr = 0
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrd2_cos_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lupper_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+ movlhps %xmm0,%xmm0 #Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+ movlhps %xmm2,%xmm2
+ movlhps %xmm4,%xmm4
+
+
+# Work on Lower arg
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+#If lower Arg is <=piby4
+ cmp %rdx,%rax # is upper arg > piby4
+ ja 0f
+
+ mov $0,%eax # region = 0
+ mov %eax,region(%rsp) # store upper region
+ movlpd %xmm0,r(%rsp) # store upper r
+ xorpd %xmm4,%xmm4 # rr = 0
+ movlpd %xmm4,rr(%rsp) # store upper rr
+ jmp .Lcheck_upper_arg
+
+.align 16
+0:
+#If upper Arg is > piby4
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm5,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # xmm0 = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm3 # npi2 * piby2_1
+ subsd %xmm3,%xmm4 # xmm4 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm4,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm1 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm1,%xmm4 # xmm4 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm4,%xmm5 # t-rhead
+ subsd %xmm5,%xmm1 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store lower region
+ movsd %xmm4,%xmm0
+ subsd %xmm1,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm4 # rr=rhead-r
+ subsd %xmm1,%xmm4 # xmm4 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r(%rsp) # store lower r
+ movlpd %xmm4,rr(%rsp) # store lower rr
+
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_upper_arg:
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd2_cos_upper_naninf
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd2_cos_upper_naninf:
+ mov p_original+8(%rsp),%rcx # upper arg is nan/inf
+
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) # rr = 0
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrd2_cos_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+
+# movhlps %xmm0, %xmm6 #Save upper fp arg for remainder_piby2 call
+ movhpd %xmm0, p_temp1(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd2_cos_lower_naninf_of_both_gt_5e5
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ call __amd_remainder_piby2@PLT
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrd2_cos_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+ mov p_original(%rsp),%rax
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) #rr = 0
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd2_cos_upper_naninf_of_both_gt_5e5
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd p_temp1(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd2_cos_upper_naninf_of_both_gt_5e5:
+ mov p_original+8(%rsp),%rcx #upper arg is nan/inf
+# movd %xmm6,%rcx ;upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) #rr = 0
+ mov %r10d,region+4(%rsp) #region = 0
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+0:
+.L__vrd2_cos_reconstruct:
+#Construct xmm0=x, xmm2 =x2, xmm4=xx, r8=region, xmm6=sign
+ movapd r(%rsp),%xmm0 #x
+ movapd %xmm0,%xmm2 #move for x2
+ mulpd %xmm2,%xmm2 #x2
+
+ movapd rr(%rsp),%xmm4 #xx
+
+ mov region(%rsp),%r8
+ mov .L__reald_one_zero(%rip),%r9 #compare value for sincos path
+ mov %r8,%r10
+ and .L__reald_one_one(%rip),%r8 #odd/even region for sin/cos
+ add .L__reald_one_one(%rip),%r10
+ and .L__reald_two_two(%rip),%r10
+ mov %r10,%r11
+ and .L__reald_two_zero(%rip),%r11 #mask out the lower sign bit leaving the upper sign bit
+ shl $62,%r10 #shift lower sign bit left by 63 bits
+ shl $30,%r11 #shift upper sign bit left by 31 bits
+ mov %r10,p_temp(%rsp) #write out lower sign bit
+ mov %r11,p_temp+8(%rsp) #write out upper sign bit
+ movapd p_temp(%rsp),%xmm6 #write out both sign bits to xmm6
+
+ jmp .L__vrd2_cos_approximate
+
+#ENDMAIN
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd2_cos_cleanup:
+ add $0x138,%rsp
+ ret
diff --git a/src/gas/vrd2exp.S b/src/gas/vrd2exp.S
new file mode 100644
index 0000000..b87763f
--- /dev/null
+++ b/src/gas/vrd2exp.S
@@ -0,0 +1,372 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# exp.asm
+#
+# A vector implementation of the exp libm function.
+#
+# Prototype:
+#
+# __m128d __vrd2_exp(__m128d x);
+#
+# Computes e raised to the x power.
+# Does not perform error checking. Denormal results are truncated to 0.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_temp,0 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for get/put bits operation
+.equ stack_size,0x28
+
+
+
+
+.globl __vrd2_exp
+ .type __vrd2_exp,@function
+__vrd2_exp:
+ sub $stack_size,%rsp
+
+
+
+# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+# Step 1. Reduce the argument.
+ # r = x * thirtytwo_by_logbaseof2;
+ movapd %xmm0,p_temp(%rsp)
+ movapd .L__real_thirtytwo_by_log2(%rip),%xmm3 #
+ maxpd .L__real_C0F0000000000000(%rip),%xmm0 # protect against very large negative, non-infinite numbers
+ mulpd %xmm0,%xmm3
+
+# save x for later.
+ minpd .L__real_40F0000000000000(%rip),%xmm3 # protect against very large, non-infinite numbers
+
+# /* Set n = nearest integer to r */
+ cvtpd2dq %xmm3,%xmm4
+ lea .L__two_to_jby32_lead_table(%rip),%rdi
+ lea .L__two_to_jby32_trail_table(%rip),%rsi
+ cvtdq2pd %xmm4,%xmm1
+
+ # r1 = x - n * logbaseof2_by_32_lead;
+ movapd .L__real_log2_by_32_lead(%rip),%xmm2 #
+ mulpd %xmm1,%xmm2 #
+ movq %xmm4,p_temp1(%rsp)
+ subpd %xmm2,%xmm0 # r1 in xmm0,
+
+# r2 = - n * logbaseof2_by_32_trail;
+ mulpd .L__real_log2_by_32_tail(%rip),%xmm1 # r2 in xmm1
+# j = n & 0x0000001f;
+ mov $0x01f,%r9
+ mov %r9,%r8
+ mov p_temp1(%rsp),%ecx
+ and %ecx,%r9d
+
+ mov p_temp1+4(%rsp),%edx
+ and %edx,%r8d
+ movapd %xmm0,%xmm2
+# f1 = two_to_jby32_lead_table[j];
+# f2 = two_to_jby32_trail_table[j];
+
+# *m = (n - j) / 32;
+ sub %r9d,%ecx
+ sar $5,%ecx #m
+ sub %r8d,%edx
+ sar $5,%edx
+
+
+ addpd %xmm1,%xmm2 #r = r1 + r2
+
+# Step 2. Compute the polynomial.
+# q = r1 + (r2 +
+# r*r*( 5.00000000000000008883e-01 +
+# r*( 1.66666666665260878863e-01 +
+# r*( 4.16666666662260795726e-02 +
+# r*( 8.33336798434219616221e-03 +
+# r*( 1.38889490863777199667e-03 ))))));
+# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+ movapd %xmm2,%xmm1
+ movapd .L__real_3f56c1728d739765(%rip),%xmm3 # 1/720
+ movapd .L__real_3FC5555555548F7C(%rip),%xmm0 # 1/6
+# deal with infinite results
+ mov $1024,%rax
+ movsx %ecx,%rcx
+ cmp %rax,%rcx
+
+ mulpd %xmm2,%xmm3 # *x
+ mulpd %xmm2,%xmm0 # *x
+ mulpd %xmm2,%xmm1 # x*x
+ movapd %xmm1,%xmm4
+
+ cmovg %rax,%rcx ## if infinite, then set rcx to multiply
+ # by infinity
+ movsx %edx,%rdx
+ cmp %rax,%rdx
+
+ addpd .L__real_3F811115B7AA905E(%rip),%xmm3 # + 1/120
+ addpd .L__real_3fe0000000000000(%rip),%xmm0 # + .5
+ mulpd %xmm1,%xmm4 # x^4
+ mulpd %xmm2,%xmm3 # *x
+
+ cmovg %rax,%rdx ## if infinite, then set rcx to multiply
+ # by infinity
+# deal with denormal results
+ xor %rax,%rax
+ add $1023,%rcx # add bias
+
+ mulpd %xmm1,%xmm0 # *x^2
+ addpd .L__real_3FA5555555545D4E(%rip),%xmm3 # + 1/24
+ addpd %xmm2,%xmm0 # + x
+ mulpd %xmm4,%xmm3 # *x^4
+# check for infinity or nan
+ movapd p_temp(%rsp),%xmm2
+ cmovs %rax,%rcx ## if denormal, then multiply by 0
+ shl $52,%rcx # build 2^n
+
+ addpd %xmm3,%xmm0 # q = final sum
+
+# *z2 = f2 + ((f1 + f2) * q);
+ movlpd (%rsi,%r9,8),%xmm5 # f2
+ movlpd (%rsi,%r8,8),%xmm4 # f2
+ addsd (%rdi,%r9,8),%xmm5 # f1 + f2
+
+ addsd (%rdi,%r8,8),%xmm4 # f1 + f2
+ shufpd $0,%xmm4,%xmm5
+
+
+ mulpd %xmm5,%xmm0
+ add $1023,%rdx # add bias
+ cmovs %rax,%rdx ## if denormal, then multiply by 0
+ addpd %xmm5,%xmm0 #z = z1 + z2
+# end of splitexp
+# /* Scale (z1 + z2) by 2.0**m */
+# r = scaleDouble_1(z, n);
+
+#;;; the following code moved to improve scheduling
+# deal with infinite results
+# mov $1024,%rax
+# movsxd %ecx,%rcx
+# cmp %rax,%rcx
+# cmovg %rax,%rcx ; if infinite, then set rcx to multiply
+ # by infinity
+# movsxd %edx,%rdx
+# cmp %rax,%rdx
+# cmovg %rax,%rdx ; if infinite, then set rcx to multiply
+ # by infinity
+
+# deal with denormal results
+# xor %rax,%rax
+# add $1023,%rcx ; add bias
+# shl $52,%rcx ; build 2^n
+
+# add $1023,%rdx ; add bias
+ shl $52,%rdx # build 2^n
+
+# check for infinity or nan
+# movapd p_temp(%rsp),%xmm2
+ andpd .L__real_infinity(%rip),%xmm2
+ cmppd $0,.L__real_infinity(%rip),%xmm2
+ mov %rcx,p_temp1(%rsp) # get 2^n to memory
+ mov %rdx,p_temp1+8(%rsp) # get 2^n to memory
+ movmskpd %xmm2,%r8d
+ test $3,%r8d
+
+# Step 3. Reconstitute.
+
+ mulpd p_temp1(%rsp),%xmm0 # result*= 2^n
+
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them. But it adds cycles for normal cases which
+# are supposed to be exceptions. Using this branch with the
+# check above results in faster code for the normal cases.
+ jnz .L__exp_naninf
+
+#
+#
+.L__final_check:
+ add $stack_size,%rsp
+ ret
+
+# at least one of the numbers needs special treatment
+.L__exp_naninf:
+# check the first number
+ test $1,%r8d
+ jz .L__check2
+
+ mov p_temp(%rsp),%rdx
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__enan1 # jump if mantissa not zero, so it's a NaN
+# inf
+ mov %rdx,%rax
+ rcl $1,%rax
+ jnc .L__r1 # exp(+inf) = inf
+ xor %rdx,%rdx # exp(-inf) = 0
+ jmp .L__r1
+
+#NaN
+.L__enan1:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__r1:
+ movd %rdx,%xmm2
+ shufpd $2,%xmm0,%xmm2
+ movsd %xmm2,%xmm0
+# check the second number
+.L__check2:
+ test $2,%r8d
+ jz .L__final_check
+ mov p_temp+8(%rsp),%rdx
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__enan2 # jump if mantissa not zero, so it's a NaN
+# inf
+ mov %rdx,%rax
+ rcl $1,%rax
+ jnc .L__r2 # exp(+inf) = inf
+ xor %rdx,%rdx # exp(-inf) = 0
+ jmp .L__r2
+
+#NaN
+.L__enan2:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__r2:
+ movd %rdx,%xmm2
+ shufpd $0,%xmm2,%xmm0
+ jmp .L__final_check
+
+ .data
+ .align 16
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000 # for alignment
+.L__real_4040000000000000: .quad 0x04040000000000000 # 32
+ .quad 0x04040000000000000
+.L__real_40F0000000000000: .quad 0x040F0000000000000 # 65536, to protect agains t really large numbers
+ .quad 0x040F0000000000000
+.L__real_C0F0000000000000: .quad 0x0C0F0000000000000 # -65536, to protect against really large negative numbers
+ .quad 0x0C0F0000000000000
+.L__real_3FA0000000000000: .quad 0x03FA0000000000000 # 1/32
+ .quad 0x03FA0000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+.L__real_infinity: .quad 0x07ff0000000000000 #
+ .quad 0x07ff0000000000000 # for alignment
+.L__real_ninfinity: .quad 0x0fff0000000000000 #
+ .quad 0x0fff0000000000000 # for alignment
+.L__real_thirtytwo_by_log2: .quad 0x040471547652b82fe # thirtytwo_by_log2
+ .quad 0x040471547652b82fe
+.L__real_log2_by_32_lead: .quad 0x03f962e42fe000000 # log2_by_32_lead
+ .quad 0x03f962e42fe000000
+.L__real_log2_by_32_tail: .quad 0x0Bdcf473de6af278e # -log2_by_32_tail
+ .quad 0x0Bdcf473de6af278e
+.L__real_3f56c1728d739765: .quad 0x03f56c1728d739765 # 1.38889490863777199667e-03
+ .quad 0x03f56c1728d739765
+.L__real_3F811115B7AA905E: .quad 0x03F811115B7AA905E # 8.33336798434219616221e-03
+ .quad 0x03F811115B7AA905E
+.L__real_3FA5555555545D4E: .quad 0x03FA5555555545D4E # 4.16666666662260795726e-02
+ .quad 0x03FA5555555545D4E
+.L__real_3FC5555555548F7C: .quad 0x03FC5555555548F7C # 1.66666666665260878863e-01
+ .quad 0x03FC5555555548F7C
+
+
+.L__two_to_jby32_lead_table:
+ .quad 0x03ff0000000000000 # 1
+ .quad 0x03ff059b0d0000000 # 1.0219
+ .quad 0x03ff0b55860000000 # 1.04427
+ .quad 0x03ff11301d0000000 # 1.06714
+ .quad 0x03ff172b830000000 # 1.09051
+ .quad 0x03ff1d48730000000 # 1.11439
+ .quad 0x03ff2387a60000000 # 1.13879
+ .quad 0x03ff29e9df0000000 # 1.16372
+ .quad 0x03ff306fe00000000 # 1.18921
+ .quad 0x03ff371a730000000 # 1.21525
+ .quad 0x03ff3dea640000000 # 1.24186
+ .quad 0x03ff44e0860000000 # 1.26905
+ .quad 0x03ff4bfdad0000000 # 1.29684
+ .quad 0x03ff5342b50000000 # 1.32524
+ .quad 0x03ff5ab07d0000000 # 1.35426
+ .quad 0x03ff6247eb0000000 # 1.38391
+ .quad 0x03ff6a09e60000000 # 1.41421
+ .quad 0x03ff71f75e0000000 # 1.44518
+ .quad 0x03ff7a11470000000 # 1.47683
+ .quad 0x03ff8258990000000 # 1.50916
+ .quad 0x03ff8ace540000000 # 1.54221
+ .quad 0x03ff93737b0000000 # 1.57598
+ .quad 0x03ff9c49180000000 # 1.61049
+ .quad 0x03ffa5503b0000000 # 1.64576
+ .quad 0x03ffae89f90000000 # 1.68179
+ .quad 0x03ffb7f76f0000000 # 1.71862
+ .quad 0x03ffc199bd0000000 # 1.75625
+ .quad 0x03ffcb720d0000000 # 1.79471
+ .quad 0x03ffd5818d0000000 # 1.83401
+ .quad 0x03ffdfc9730000000 # 1.87417
+ .quad 0x03ffea4afa0000000 # 1.91521
+ .quad 0x03fff507650000000 # 1.95714
+ .quad 0 # for alignment
+.L__two_to_jby32_trail_table:
+ .quad 0x00000000000000000 # 0
+ .quad 0x03e48ac2ba1d73e2a # 1.1489e-008
+ .quad 0x03e69f3121ec53172 # 4.83347e-008
+ .quad 0x03df25b50a4ebbf1b # 2.67125e-010
+ .quad 0x03e68faa2f5b9bef9 # 4.65271e-008
+ .quad 0x03e368b9aa7805b80 # 5.24924e-009
+ .quad 0x03e6ceac470cd83f6 # 5.38622e-008
+ .quad 0x03e547f7b84b09745 # 1.90902e-008
+ .quad 0x03e64636e2a5bd1ab # 3.79764e-008
+ .quad 0x03e5ceaa72a9c5154 # 2.69307e-008
+ .quad 0x03e682468446b6824 # 4.49684e-008
+ .quad 0x03e18624b40c4dbd0 # 1.41933e-009
+ .quad 0x03e54d8a89c750e5e # 1.94147e-008
+ .quad 0x03e5a753e077c2a0f # 2.46409e-008
+ .quad 0x03e6a90a852b19260 # 4.94813e-008
+ .quad 0x03e0d2ac258f87d03 # 8.48872e-010
+ .quad 0x03e59fcef32422cbf # 2.42032e-008
+ .quad 0x03e61d8bee7ba46e2 # 3.3242e-008
+ .quad 0x03e4f580c36bea881 # 1.45957e-008
+ .quad 0x03e62999c25159f11 # 3.46453e-008
+ .quad 0x03e415506dadd3e2a # 8.0709e-009
+ .quad 0x03e29b8bc9e8a0388 # 2.99439e-009
+ .quad 0x03e451f8480e3e236 # 9.83622e-009
+ .quad 0x03e41f12ae45a1224 # 8.35492e-009
+ .quad 0x03e62b5a75abd0e6a # 3.48493e-008
+ .quad 0x03e47daf237553d84 # 1.11085e-008
+ .quad 0x03e6b0aa538444196 # 5.03689e-008
+ .quad 0x03e69df20d22a0798 # 4.81896e-008
+ .quad 0x03e69f7490e4bb40b # 4.83654e-008
+ .quad 0x03e4bdcdaf5cb4656 # 1.29746e-008
+ .quad 0x03e452486cc2c7b9d # 9.84533e-009
+ .quad 0x03e66dc8a80ce9f09 # 4.25828e-008
+ .quad 0 # for alignment
+
diff --git a/src/gas/vrd2log.S b/src/gas/vrd2log.S
new file mode 100644
index 0000000..30bb3b1
--- /dev/null
+++ b/src/gas/vrd2log.S
@@ -0,0 +1,573 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd2log.s
+#
+# An implementation of the log libm function.
+#
+# Prototype:
+#
+# __m128d __vrd2_log(__m128d x);
+#
+# Computes the natural log of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error. Runs 105-115 cycles for valid inputs.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+
+.equ stack_size,0x028
+
+
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrd2_log
+ .type __vrd2_log,@function
+__vrd2_log:
+ sub $stack_size,%rsp
+
+ movdqa %xmm0,p_x(%rsp) # save the input values
+
+
+# /* Store the exponent of x in xexp and put
+# f into the range [0.5,1) */
+
+#
+# compute the index into the log tables
+#
+ pxor %xmm1,%xmm1
+ movdqa %xmm0,%xmm3
+ psrlq $52,%xmm3
+ psubq .L__mask_1023(%rip),%xmm3
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm6 # xexp
+ movdqa %xmm0,%xmm2
+ xor %rax,%rax
+ subpd .L__real_one(%rip),%xmm2
+
+ movdqa %xmm0,%xmm3
+ andpd .L__real_notsign(%rip),%xmm2
+ pand .L__real_mant(%rip),%xmm3
+ movdqa %xmm3,%xmm4
+ movapd .L__real_half(%rip),%xmm5 # .5
+
+ cmppd $1,.L__real_threshold(%rip),%xmm2
+ movmskpd %xmm2,%r10d
+ cmp $3,%r10d
+ jz .Lall_nearone
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrlq $45,%xmm3
+ movdqa %xmm3,%xmm2
+ psrlq $1,%xmm3
+ paddq .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm2
+ paddq %xmm2,%xmm3
+
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm1
+ xor %rcx,%rcx
+ movq %xmm3,p_idx(%rsp)
+
+# reduce and get u
+ por .L__real_half(%rip),%xmm4
+ movdqa %xmm4,%xmm2
+
+
+ mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128
+
+
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx(%rsp),%eax
+
+ subpd %xmm1,%xmm2 # f2 = f - f1
+ mulpd %xmm2,%xmm5
+ addpd %xmm5,%xmm1
+
+ divpd %xmm1,%xmm2 # u
+
+# do error checking here for scheduling. Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+## if NaN or inf
+ movapd %xmm0,%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r8d
+ xorpd %xmm1,%xmm1
+
+ cmppd $2,%xmm1,%xmm0
+ movmskpd %xmm0,%r9d
+
+# get z
+ movlpd -512(%rdx,%rax,8),%xmm0 # z1
+ mov p_idx+4(%rsp),%ecx
+ movhpd -512(%rdx,%rcx,8),%xmm0 # z1
+# solve for ln(1+u)
+ movapd %xmm2,%xmm1 # u
+ mulpd %xmm2,%xmm2 # u^2
+ movapd %xmm2,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm2,%xmm3 #Cu2
+ mulpd %xmm1,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm2 # u^5
+ movapd .L__real_log2_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm1 # u+Au3
+ mulpd %xmm3,%xmm2 # u5(B+Cu2)
+
+ addpd %xmm2,%xmm1 # poly
+# recombine
+ mulpd %xmm6,%xmm4 # xexp * log2_lead
+ addpd %xmm4,%xmm0 #r1
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q
+ addpd %xmm4,%xmm1
+
+ mulpd .L__real_log2_tail(%rip),%xmm6
+
+ addpd %xmm6,%xmm1 #r2
+
+# check for nans/infs
+ test $3,%r8d
+ addpd %xmm1,%xmm0
+ jnz .L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+ test $3,%r9d
+ jnz .L__z_or_n
+
+.L__finish:
+# see if we have a near one value
+ test $3,%r10d
+ jnz .L__near_one
+.L__finishn1:
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+.Lall_nearone:
+# saves 10 cycles
+# r = x - 1.0;
+ movapd .L__real_two(%rip),%xmm2
+ subpd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addpd %xmm0,%xmm2
+ movapd %xmm0,%xmm1
+ divpd %xmm2,%xmm1 # u
+ movapd .L__real_ca4(%rip),%xmm4 #D
+ movapd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movapd %xmm0,%xmm6
+ mulpd %xmm1,%xmm6 # correction
+# u = u + u;
+ addpd %xmm1,%xmm1 #u
+ movapd %xmm1,%xmm2
+ mulpd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulpd %xmm1,%xmm5 # Cu
+ movapd %xmm1,%xmm3
+ mulpd %xmm2,%xmm3 # u^3
+ mulpd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulpd %xmm3,%xmm4 #Du^3
+
+ addpd .L__real_ca1(%rip),%xmm2 # +A
+ movapd %xmm3,%xmm1
+ mulpd %xmm1,%xmm1 # u^6
+ addpd %xmm4,%xmm5 #Cu+Du3
+# subsd %xmm6,%xmm0 ; -correction
+
+ mulpd %xmm3,%xmm2 #u3(A+Bu2)
+ mulpd %xmm5,%xmm1 #u6(Cu+Du3)
+ addpd %xmm1,%xmm2
+ subpd %xmm6,%xmm2 # -correction
+
+# return r + r2;
+ addpd %xmm2,%xmm0
+ jmp .L__finishn1
+
+ .align 16
+.L__near_one:
+ test $1,%r10d
+ jz .L__lnn12
+
+# movapd %xmm0,%xmm6 ; save the inputs
+ movlpd p_x(%rsp),%xmm0
+ call .L__ln1
+# shufpd xmm0,$2,%xmm6
+
+.L__lnn12:
+ test $2,%r10d # second number?
+ jz .L__lnn1e
+ movlpd %xmm0,p_x(%rsp)
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd p_x(%rsp),%xmm0
+# shufpd xmm6,$0,%xmm0
+# movapd %xmm6,%xmm0
+
+.L__lnn1e:
+ jmp .L__finishn1
+
+.L__ln1:
+# saves 10 cycles
+# r = x - 1.0;
+ movlpd .L__real_two(%rip),%xmm2
+ subsd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addsd %xmm0,%xmm2
+ movsd %xmm0,%xmm1
+ divsd %xmm2,%xmm1 # u
+ movlpd .L__real_ca4(%rip),%xmm4 #D
+ movlpd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movsd %xmm0,%xmm6
+ mulsd %xmm1,%xmm6 # correction
+# u = u + u;
+ addsd %xmm1,%xmm1 #u
+ movsd %xmm1,%xmm2
+ mulsd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulsd %xmm1,%xmm5 # Cu
+ movsd %xmm1,%xmm3
+ mulsd %xmm2,%xmm3 # u^3
+ mulsd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulsd %xmm3,%xmm4 #Du^3
+
+ addsd .L__real_ca1(%rip),%xmm2 # +A
+ movsd %xmm3,%xmm1
+ mulsd %xmm1,%xmm1 # u^6
+ addsd %xmm4,%xmm5 #Cu+Du3
+# subsd %xmm6,%xmm0 ; -correction
+
+ mulsd %xmm3,%xmm2 #u3(A+Bu2)
+ mulsd %xmm5,%xmm1 #u6(Cu+Du3)
+ addsd %xmm1,%xmm2
+ subsd %xmm6,%xmm2 # -correction
+
+# return r + r2;
+ addsd %xmm2,%xmm0
+ ret
+
+ .align 16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+ test $1,%r8d # first number?
+ jz .L__lninf2
+
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rdx
+ movlpd p_x(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm1,%xmm0
+
+.L__lninf2:
+ test $2,%r8d # second number?
+ jz .L__lninfe
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rdx
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+
+.L__lninfe:
+
+ cmp $3,%r8d # both numbers?
+ jz .L__finish # return early if so
+ jmp .L__vlog1 # continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__lnan # jump if mantissa not zero, so it's a NaN
+# inf
+ rcl $1,%rdx
+ jnc .L__lne2 # log(+inf) = inf
+# negative x
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+#NaN
+.L__lnan:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__lne:
+ movd %rdx,%xmm0
+.L__lne2:
+ ret
+
+ .align 16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+ test $1,%r9d # first number?
+ jz .L__zn2
+
+# movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rax
+ call .L__zni
+# shufpd $2,%xmm1,%xmm0
+
+.L__zn2:
+ test $2,%r9d # second number?
+ jz .L__zne
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+
+.L__zne:
+ jmp .L__finish
+
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+ shl $1,%rax
+ jnz .L__zn_x ## if just a carry, then must be negative
+ movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0
+ ret
+.L__zn_x:
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+
+
+
+ .data
+ .align 16
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_two: .quad 0x04000000000000000 # 2.0
+ .quad 0x04000000000000000
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0fff0000000000000
+.L__real_inf: .quad 0x07ff0000000000000 # +inf
+ .quad 0x07ff0000000000000
+.L__real_nan: .quad 0x07ff8000000000000 # NaN
+ .quad 0x07ff8000000000000
+
+.L__real_sign: .quad 0x08000000000000000 # sign bit
+ .quad 0x08000000000000000
+.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold: .quad 0x03F9EB85000000000 # .03
+ .quad 0x03F9EB85000000000
+.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit
+ .quad 0x00008000000000000
+.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03f80000000000000
+.L__mask_1023: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+.L__mask_040: .quad 0x00000000000000040 #
+ .quad 0x00000000000000040
+.L__mask_001: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001
+
+.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x03fb55555555554e6
+.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x03f89999999bac6d4
+.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x03f62492307f1519f
+.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x03f3c8034c85dfff0
+
+.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02
+ .quad 0x03fb5555555555557
+.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02
+ .quad 0x03f89999999865ede
+.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03
+ .quad 0x03f6249423bd94741
+.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x03fe62e42e0000000
+.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x03e6efa39ef35793c
+
+.L__real_half: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+
+ .align 16
+
+.L__np_ln_lead_table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02
+ .quad 0x3f9f829800000000 # 3.07716131210327148438e-02
+ .quad 0x3fa7745800000000 # 4.58095073699951171875e-02
+ .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02
+ .quad 0x3fb341d700000000 # 7.52233862876892089844e-02
+ .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02
+ .quad 0x3fba926d00000000 # 1.03796780109405517578e-01
+ .quad 0x3fbe270700000000 # 1.17783010005950927734e-01
+ .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01
+ .quad 0x3fc2955280000000 # 1.45181953907012939453e-01
+ .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01
+ .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01
+ .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01
+ .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01
+ .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01
+ .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01
+ .quad 0x3fce270700000000 # 2.35566020011901855469e-01
+ .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01
+ .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01
+ .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01
+ .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01
+ .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01
+ .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01
+ .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01
+ .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01
+ .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01
+ .quad 0x3fd686c800000000 # 3.51976394653320312500e-01
+ .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01
+ .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01
+ .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01
+ .quad 0x3fd9479400000000 # 3.94993782043457031250e-01
+ .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01
+ .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01
+ .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01
+ .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01
+ .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01
+ .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01
+ .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01
+ .quad 0x3fde744240000000 # 4.75845873355865478516e-01
+ .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01
+ .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01
+ .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01
+ .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01
+ .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01
+ .quad 0x3fe109f380000000 # 5.32464742660522460938e-01
+ .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01
+ .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01
+ .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01
+ .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01
+ .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01
+ .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01
+ .quad 0x3fe307d720000000 # 5.94707071781158447266e-01
+ .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01
+ .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01
+ .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01
+ .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01
+ .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01
+ .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01
+ .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01
+ .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01
+ .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01
+ .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01
+ .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01
+ .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_tail_table:
+ .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00
+ .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09
+ .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08
+ .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08
+ .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08
+ .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08
+ .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08
+ .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08
+ .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08
+ .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08
+ .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08
+ .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08
+ .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08
+ .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10
+ .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08
+ .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08
+ .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08
+ .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08
+ .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08
+ .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08
+ .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08
+ .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08
+ .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08
+ .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08
+ .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09
+ .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09
+ .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08
+ .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08
+ .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08
+ .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08
+ .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09
+ .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08
+ .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08
+ .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08
+ .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08
+ .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08
+ .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09
+ .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08
+ .quad 0x03e33071282fb989b # 4.43021445893361960146e-09
+ .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08
+ .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08
+ .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08
+ .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08
+ .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08
+ .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09
+ .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08
+ .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08
+ .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08
+ .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08
+ .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08
+ .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08
+ .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08
+ .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08
+ .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08
+ .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08
+ .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08
+ .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08
+ .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09
+ .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08
+ .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08
+ .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08
+ .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08
+ .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08
+ .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08
+ .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08
+ .quad 0 # for alignment
+
+
diff --git a/src/gas/vrd2log10.S b/src/gas/vrd2log10.S
new file mode 100644
index 0000000..46cb2ad
--- /dev/null
+++ b/src/gas/vrd2log10.S
@@ -0,0 +1,628 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd2log10.s
+#
+# An implementation of the log10 libm function.
+#
+# Prototype:
+#
+# __m128d __vrd2_log10(__m128d x);
+#
+# Computes the natural log10 of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error. Runs 120-130 cycles for valid inputs.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+
+.equ stack_size,0x028
+
+
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrd2_log10
+ .type __vrd2_log10,@function
+__vrd2_log10:
+ sub $stack_size,%rsp
+
+ movdqa %xmm0,p_x(%rsp) # save the input values
+
+
+# /* Store the exponent of x in xexp and put
+# f into the range [0.5,1) */
+
+#
+# compute the index into the log10 tables
+#
+ pxor %xmm1,%xmm1
+ movdqa %xmm0,%xmm3
+ psrlq $52,%xmm3
+ psubq .L__mask_1023(%rip),%xmm3
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm6 # xexp
+ movdqa %xmm0,%xmm2
+ xor %rax,%rax
+ subpd .L__real_one(%rip),%xmm2
+
+ movdqa %xmm0,%xmm3
+ andpd .L__real_notsign(%rip),%xmm2
+ pand .L__real_mant(%rip),%xmm3
+ movdqa %xmm3,%xmm4
+ movapd .L__real_half(%rip),%xmm5 # .5
+
+ cmppd $1,.L__real_threshold(%rip),%xmm2
+ movmskpd %xmm2,%r10d
+ cmp $3,%r10d
+ jz .Lall_nearone
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrlq $45,%xmm3
+ movdqa %xmm3,%xmm2
+ psrlq $1,%xmm3
+ paddq .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm2
+ paddq %xmm2,%xmm3
+
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm1
+ xor %rcx,%rcx
+ movq %xmm3,p_idx(%rsp)
+
+# reduce and get u
+ por .L__real_half(%rip),%xmm4
+ movdqa %xmm4,%xmm2
+
+
+ mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128
+
+
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx(%rsp),%eax
+
+ subpd %xmm1,%xmm2 # f2 = f - f1
+ mulpd %xmm2,%xmm5
+ addpd %xmm5,%xmm1
+
+ divpd %xmm1,%xmm2 # u
+
+# do error checking here for scheduling. Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+## if NaN or inf
+ movapd %xmm0,%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r8d
+ xorpd %xmm1,%xmm1
+
+ cmppd $2,%xmm1,%xmm0
+ movmskpd %xmm0,%r9d
+
+# get z
+ movlpd -512(%rdx,%rax,8),%xmm0 # z1
+ mov p_idx+4(%rsp),%ecx
+ movhpd -512(%rdx,%rcx,8),%xmm0 # z1
+# solve for ln(1+u)
+ movapd %xmm2,%xmm1 # u
+ mulpd %xmm2,%xmm2 # u^2
+ movapd %xmm2,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm2,%xmm3 #Cu2
+ mulpd %xmm1,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm2 # u^5
+ movapd .L__real_log2_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm1 # u+Au3
+ mulpd %xmm3,%xmm2 # u5(B+Cu2)
+
+ addpd %xmm2,%xmm1 # poly
+# recombine
+ mulpd %xmm6,%xmm4 # xexp * log2_lead
+ addpd %xmm4,%xmm0 #r1
+ movapd %xmm0,%xmm2
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q
+ addpd %xmm4,%xmm1
+
+ mulpd .L__real_log2_tail(%rip),%xmm6
+
+ addpd %xmm6,%xmm1 #r2
+
+# loge to log10
+ movapd %xmm1,%xmm3
+ mulpd .L__real_log10e_tail(%rip),%xmm1
+ mulpd .L__real_log10e_tail(%rip),%xmm0
+ addpd %xmm1,%xmm0
+ mulpd .L__real_log10e_lead(%rip),%xmm3
+ addpd %xmm3,%xmm0
+ mulpd .L__real_log10e_lead(%rip),%xmm2
+# check for nans/infs
+ test $3,%r8d
+ addpd %xmm2,%xmm0
+
+ jnz .L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+ test $3,%r9d
+ jnz .L__z_or_n
+
+.L__finish:
+# see if we have a near one value
+ test $3,%r10d
+ jnz .L__near_one
+.L__finishn1:
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+.Lall_nearone:
+# saves 10 cycles
+# r = x - 1.0;
+ movapd .L__real_two(%rip),%xmm2
+ subpd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addpd %xmm0,%xmm2
+ movapd %xmm0,%xmm1
+ divpd %xmm2,%xmm1 # u
+ movapd .L__real_ca4(%rip),%xmm4 #D
+ movapd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movapd %xmm0,%xmm6
+ mulpd %xmm1,%xmm6 # correction
+# u = u + u;
+ addpd %xmm1,%xmm1 #u
+ movapd %xmm1,%xmm2
+ mulpd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulpd %xmm1,%xmm5 # Cu
+ movapd %xmm1,%xmm3
+ mulpd %xmm2,%xmm3 # u^3
+ mulpd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulpd %xmm3,%xmm4 #Du^3
+
+ addpd .L__real_ca1(%rip),%xmm2 # +A
+ movapd %xmm3,%xmm1
+ mulpd %xmm1,%xmm1 # u^6
+ addpd %xmm4,%xmm5 #Cu+Du3
+# subsd %xmm6,%xmm0 ; -correction
+
+ mulpd %xmm3,%xmm2 #u3(A+Bu2)
+ mulpd %xmm5,%xmm1 #u6(Cu+Du3)
+ addpd %xmm1,%xmm2
+ subpd %xmm6,%xmm2 # -correction
+
+# loge to log10
+ movapd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subpd %xmm3,%xmm0
+ addpd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movapd %xmm3,%xmm0
+ movapd %xmm2,%xmm1
+
+ mulpd .L__real_log10e_tail(%rip),%xmm2
+ mulpd .L__real_log10e_tail(%rip),%xmm0
+ mulpd .L__real_log10e_lead(%rip),%xmm1
+ mulpd .L__real_log10e_lead(%rip),%xmm3
+ addpd %xmm2,%xmm0
+ addpd %xmm1,%xmm0
+ addpd %xmm3,%xmm0
+
+# return r + r2;
+# addpd %xmm2,%xmm0
+ jmp .L__finishn1
+
+ .align 16
+.L__near_one:
+ test $1,%r10d
+ jz .L__lnn12
+
+# movapd %xmm0,%xmm6 ; save the inputs
+ movlpd p_x(%rsp),%xmm0
+ call .L__ln1
+# shufpd xmm0,$2,%xmm6
+
+.L__lnn12:
+ test $2,%r10d # second number?
+ jz .L__lnn1e
+ movlpd %xmm0,p_x(%rsp)
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd p_x(%rsp),%xmm0
+# shufpd xmm6,$0,%xmm0
+# movapd %xmm6,%xmm0
+
+.L__lnn1e:
+ jmp .L__finishn1
+
+.L__ln1:
+# saves 10 cycles
+# r = x - 1.0;
+ movlpd .L__real_two(%rip),%xmm2
+ subsd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addsd %xmm0,%xmm2
+ movsd %xmm0,%xmm1
+ divsd %xmm2,%xmm1 # u
+ movlpd .L__real_ca4(%rip),%xmm4 #D
+ movlpd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movsd %xmm0,%xmm6
+ mulsd %xmm1,%xmm6 # correction
+# u = u + u;
+ addsd %xmm1,%xmm1 #u
+ movsd %xmm1,%xmm2
+ mulsd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulsd %xmm1,%xmm5 # Cu
+ movsd %xmm1,%xmm3
+ mulsd %xmm2,%xmm3 # u^3
+ mulsd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulsd %xmm3,%xmm4 #Du^3
+
+ addsd .L__real_ca1(%rip),%xmm2 # +A
+ movsd %xmm3,%xmm1
+ mulsd %xmm1,%xmm1 # u^6
+ addsd %xmm4,%xmm5 #Cu+Du3
+# subsd %xmm6,%xmm0 ; -correction
+
+ mulsd %xmm3,%xmm2 #u3(A+Bu2)
+ mulsd %xmm5,%xmm1 #u6(Cu+Du3)
+ addsd %xmm1,%xmm2
+ subsd %xmm6,%xmm2 # -correction
+
+# loge to log10
+ movsd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subsd %xmm3,%xmm0
+ addsd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movsd %xmm3,%xmm0
+ movsd %xmm2,%xmm1
+
+ mulsd .L__real_log10e_tail(%rip),%xmm2
+ mulsd .L__real_log10e_tail(%rip),%xmm0
+ mulsd .L__real_log10e_lead(%rip),%xmm1
+ mulsd .L__real_log10e_lead(%rip),%xmm3
+ addsd %xmm2,%xmm0
+ addsd %xmm1,%xmm0
+ addsd %xmm3,%xmm0
+
+
+
+# return r + r2;
+# addsd %xmm2,%xmm0
+ ret
+
+ .align 16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+ test $1,%r8d # first number?
+ jz .L__lninf2
+
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rdx
+ movlpd p_x(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm1,%xmm0
+
+.L__lninf2:
+ test $2,%r8d # second number?
+ jz .L__lninfe
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rdx
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+
+.L__lninfe:
+
+ cmp $3,%r8d # both numbers?
+ jz .L__finish # return early if so
+ jmp .L__vlog1 # continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__lnan # jump if mantissa not zero, so it's a NaN
+# inf
+ rcl $1,%rdx
+ jnc .L__lne2 # log(+inf) = inf
+# negative x
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+#NaN
+.L__lnan:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__lne:
+ movd %rdx,%xmm0
+.L__lne2:
+ ret
+
+ .align 16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+ test $1,%r9d # first number?
+ jz .L__zn2
+
+# movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rax
+ call .L__zni
+# shufpd $2,%xmm1,%xmm0
+
+.L__zn2:
+ test $2,%r9d # second number?
+ jz .L__zne
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+
+.L__zne:
+ jmp .L__finish
+
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+ shl $1,%rax
+ jnz .L__zn_x ## if just a carry, then must be negative
+ movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0
+ ret
+.L__zn_x:
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+
+
+
+ .data
+ .align 16
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_two: .quad 0x04000000000000000 # 2.0
+ .quad 0x04000000000000000
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0fff0000000000000
+.L__real_inf: .quad 0x07ff0000000000000 # +inf
+ .quad 0x07ff0000000000000
+.L__real_nan: .quad 0x07ff8000000000000 # NaN
+ .quad 0x07ff8000000000000
+
+.L__real_sign: .quad 0x08000000000000000 # sign bit
+ .quad 0x08000000000000000
+.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold: .quad 0x03FB082C000000000 # .064495086669921875 Threshold
+ .quad 0x03FB082C000000000
+.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit
+ .quad 0x00008000000000000
+.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03f80000000000000
+.L__mask_1023: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+.L__mask_040: .quad 0x00000000000000040 #
+ .quad 0x00000000000000040
+.L__mask_001: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001
+
+.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x03fb55555555554e6
+.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x03f89999999bac6d4
+.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x03f62492307f1519f
+.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x03f3c8034c85dfff0
+
+.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02
+ .quad 0x03fb5555555555557
+.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02
+ .quad 0x03f89999999865ede
+.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03
+ .quad 0x03f6249423bd94741
+.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x03fe62e42e0000000
+.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x03e6efa39ef35793c
+
+.L__real_half: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+
+.L__real_log10e_lead: .quad 0x03fdbcb7800000000 # log10e_lead 4.34293746948242187500e-01
+ .quad 0x03fdbcb7800000000
+.L__real_log10e_tail: .quad 0x03ea8a93728719535 # log10e_tail 7.3495500964015109100644e-7
+ .quad 0x03ea8a93728719535
+
+.L__mask_lower: .quad 0x0ffffffff00000000
+ .quad 0x0ffffffff00000000
+
+ .align 16
+
+.L__np_ln_lead_table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02
+ .quad 0x3f9f829800000000 # 3.07716131210327148438e-02
+ .quad 0x3fa7745800000000 # 4.58095073699951171875e-02
+ .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02
+ .quad 0x3fb341d700000000 # 7.52233862876892089844e-02
+ .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02
+ .quad 0x3fba926d00000000 # 1.03796780109405517578e-01
+ .quad 0x3fbe270700000000 # 1.17783010005950927734e-01
+ .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01
+ .quad 0x3fc2955280000000 # 1.45181953907012939453e-01
+ .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01
+ .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01
+ .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01
+ .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01
+ .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01
+ .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01
+ .quad 0x3fce270700000000 # 2.35566020011901855469e-01
+ .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01
+ .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01
+ .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01
+ .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01
+ .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01
+ .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01
+ .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01
+ .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01
+ .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01
+ .quad 0x3fd686c800000000 # 3.51976394653320312500e-01
+ .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01
+ .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01
+ .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01
+ .quad 0x3fd9479400000000 # 3.94993782043457031250e-01
+ .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01
+ .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01
+ .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01
+ .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01
+ .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01
+ .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01
+ .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01
+ .quad 0x3fde744240000000 # 4.75845873355865478516e-01
+ .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01
+ .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01
+ .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01
+ .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01
+ .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01
+ .quad 0x3fe109f380000000 # 5.32464742660522460938e-01
+ .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01
+ .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01
+ .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01
+ .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01
+ .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01
+ .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01
+ .quad 0x3fe307d720000000 # 5.94707071781158447266e-01
+ .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01
+ .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01
+ .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01
+ .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01
+ .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01
+ .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01
+ .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01
+ .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01
+ .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01
+ .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01
+ .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01
+ .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_tail_table:
+ .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00
+ .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09
+ .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08
+ .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08
+ .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08
+ .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08
+ .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08
+ .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08
+ .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08
+ .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08
+ .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08
+ .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08
+ .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08
+ .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10
+ .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08
+ .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08
+ .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08
+ .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08
+ .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08
+ .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08
+ .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08
+ .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08
+ .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08
+ .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08
+ .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09
+ .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09
+ .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08
+ .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08
+ .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08
+ .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08
+ .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09
+ .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08
+ .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08
+ .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08
+ .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08
+ .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08
+ .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09
+ .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08
+ .quad 0x03e33071282fb989b # 4.43021445893361960146e-09
+ .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08
+ .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08
+ .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08
+ .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08
+ .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08
+ .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09
+ .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08
+ .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08
+ .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08
+ .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08
+ .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08
+ .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08
+ .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08
+ .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08
+ .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08
+ .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08
+ .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08
+ .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08
+ .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09
+ .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08
+ .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08
+ .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08
+ .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08
+ .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08
+ .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08
+ .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08
+ .quad 0 # for alignment
+
+
diff --git a/src/gas/vrd2log2.S b/src/gas/vrd2log2.S
new file mode 100644
index 0000000..92fe290
--- /dev/null
+++ b/src/gas/vrd2log2.S
@@ -0,0 +1,621 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd2log.s
+#
+# An implementation of the log libm function.
+#
+# Prototype:
+#
+# __m128d __vrd2_log2(__m128d x);
+#
+# Computes the log2 of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error. Runs 105-115 cycles for valid inputs.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+
+.equ stack_size,0x028
+
+
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrd2_log2
+ .type __vrd2_log2,@function
+__vrd2_log2:
+ sub $stack_size,%rsp
+
+ movdqa %xmm0,p_x(%rsp) # save the input values
+
+
+# /* Store the exponent of x in xexp and put
+# f into the range [0.5,1) */
+
+#
+# compute the index into the log tables
+#
+ pxor %xmm1,%xmm1
+ movdqa %xmm0,%xmm3
+ psrlq $52,%xmm3
+ psubq .L__mask_1023(%rip),%xmm3
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm6 # xexp
+ movdqa %xmm0,%xmm2
+ xor %rax,%rax
+ subpd .L__real_one(%rip),%xmm2
+
+ movdqa %xmm0,%xmm3
+ andpd .L__real_notsign(%rip),%xmm2
+ pand .L__real_mant(%rip),%xmm3
+ movdqa %xmm3,%xmm4
+ movapd .L__real_half(%rip),%xmm5 # .5
+
+ cmppd $1,.L__real_threshold(%rip),%xmm2
+ movmskpd %xmm2,%r10d
+ cmp $3,%r10d
+ jz .Lall_nearone
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrlq $45,%xmm3
+ movdqa %xmm3,%xmm2
+ psrlq $1,%xmm3
+ paddq .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm2
+ paddq %xmm2,%xmm3
+
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm1
+ xor %rcx,%rcx
+ movq %xmm3,p_idx(%rsp)
+
+# reduce and get u
+ por .L__real_half(%rip),%xmm4
+ movdqa %xmm4,%xmm2
+
+
+ mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128
+
+
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx(%rsp),%eax
+
+ subpd %xmm1,%xmm2 # f2 = f - f1
+ mulpd %xmm2,%xmm5
+ addpd %xmm5,%xmm1
+
+ divpd %xmm1,%xmm2 # u
+
+# do error checking here for scheduling. Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+## if NaN or inf
+ movapd %xmm0,%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r8d
+ xorpd %xmm1,%xmm1
+
+ cmppd $2,%xmm1,%xmm0
+ movmskpd %xmm0,%r9d
+
+# get z
+ movlpd -512(%rdx,%rax,8),%xmm0 # z1
+ mov p_idx+4(%rsp),%ecx
+ movhpd -512(%rdx,%rcx,8),%xmm0 # z1
+# solve for ln(1+u)
+ movapd %xmm2,%xmm1 # u
+ mulpd %xmm2,%xmm2 # u^2
+ movapd %xmm2,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm2,%xmm3 #Cu2
+ mulpd %xmm1,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm2 # u^5
+ movapd .L__real_log2e_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm1 # u+Au3
+ movapd %xmm0,%xmm5 # z1 copy
+ mulpd %xmm3,%xmm2 # u5(B+Cu2)
+ movapd .L__real_log2e_tail(%rip),%xmm3
+ addpd %xmm2,%xmm1 # poly
+# recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q
+ addpd %xmm2,%xmm1
+ movapd %xmm1,%xmm2 #z2 copy
+
+ mulpd %xmm4,%xmm5 #z1*log2e_lead
+ mulpd %xmm4,%xmm1 #z2*log2e_lead
+ mulpd %xmm3,%xmm2 #z2*log2e_tail
+ mulpd %xmm3,%xmm0 #z1*log2e_tail
+ addpd %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp
+ addpd %xmm2,%xmm0 #z1*log2e_tail + z2*log2e_tail
+ addpd %xmm1,%xmm0 #r2
+
+
+# check for nans/infs
+ test $3,%r8d
+ addpd %xmm5,%xmm0 #r1+r2
+ jnz .L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+ test $3,%r9d
+ jnz .L__z_or_n
+
+.L__finish:
+# see if we have a near one value
+ test $3,%r10d
+ jnz .L__near_one
+.L__finishn1:
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+.Lall_nearone:
+# saves 10 cycles
+# r = x - 1.0;
+ movapd .L__real_two(%rip),%xmm2
+ subpd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addpd %xmm0,%xmm2
+ movapd %xmm0,%xmm1
+ divpd %xmm2,%xmm1 # u
+ movapd .L__real_ca4(%rip),%xmm4 #D
+ movapd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movapd %xmm0,%xmm6
+ mulpd %xmm1,%xmm6 # correction
+# u = u + u;
+ addpd %xmm1,%xmm1 #u
+ movapd %xmm1,%xmm2
+ mulpd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulpd %xmm1,%xmm5 # Cu
+ movapd %xmm1,%xmm3
+ mulpd %xmm2,%xmm3 # u^3
+ mulpd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulpd %xmm3,%xmm4 #Du^3
+
+ addpd .L__real_ca1(%rip),%xmm2 # +A
+ movapd %xmm3,%xmm1
+ mulpd %xmm1,%xmm1 # u^6
+ addpd %xmm4,%xmm5 #Cu+Du3
+ movapd .L__real_log2e_tail(%rip),%xmm4
+# subsd %xmm6,%xmm0 ; -correction
+
+ mulpd %xmm3,%xmm2 #u3(A+Bu2)
+ mulpd %xmm5,%xmm1 #u6(Cu+Du3)
+ movapd .L__real_log2e_lead(%rip),%xmm5
+ addpd %xmm1,%xmm2
+ subpd %xmm6,%xmm2 # -correction
+
+# loge to log2
+ movapd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subpd %xmm3,%xmm0
+ addpd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movapd %xmm3,%xmm0
+ movapd %xmm2,%xmm1
+
+ mulpd %xmm4,%xmm2
+ mulpd %xmm4,%xmm0
+ mulpd %xmm5,%xmm1
+ mulpd %xmm5,%xmm3
+ addpd %xmm2,%xmm0
+ addpd %xmm1,%xmm0
+ addpd %xmm3,%xmm0
+# return r + r2;
+# addpd %xmm2,%xmm0
+ jmp .L__finishn1
+
+ .align 16
+.L__near_one:
+ test $1,%r10d
+ jz .L__lnn12
+
+# movapd %xmm0,%xmm6 ; save the inputs
+ movlpd p_x(%rsp),%xmm0
+ call .L__ln1
+# shufpd xmm0,$2,%xmm6
+
+.L__lnn12:
+ test $2,%r10d # second number?
+ jz .L__lnn1e
+ movlpd %xmm0,p_x(%rsp)
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd p_x(%rsp),%xmm0
+# shufpd xmm6,$0,%xmm0
+# movapd %xmm6,%xmm0
+
+.L__lnn1e:
+ jmp .L__finishn1
+
+.L__ln1:
+# saves 10 cycles
+# r = x - 1.0;
+ movlpd .L__real_two(%rip),%xmm2
+ subsd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addsd %xmm0,%xmm2
+ movsd %xmm0,%xmm1
+ divsd %xmm2,%xmm1 # u
+ movlpd .L__real_ca4(%rip),%xmm4 #D
+ movlpd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movsd %xmm0,%xmm6
+ mulsd %xmm1,%xmm6 # correction
+# u = u + u;
+ addsd %xmm1,%xmm1 #u
+ movsd %xmm1,%xmm2
+ mulsd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulsd %xmm1,%xmm5 # Cu
+ movsd %xmm1,%xmm3
+ mulsd %xmm2,%xmm3 # u^3
+ mulsd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulsd %xmm3,%xmm4 #Du^3
+
+ addsd .L__real_ca1(%rip),%xmm2 # +A
+ movsd %xmm3,%xmm1
+ mulsd %xmm1,%xmm1 # u^6
+ addsd %xmm4,%xmm5 #Cu+Du3
+# subsd %xmm6,%xmm0 ; -correction
+ movsd .L__real_log2e_tail(%rip),%xmm4
+
+ mulsd %xmm3,%xmm2 #u3(A+Bu2)
+ mulsd %xmm5,%xmm1 #u6(Cu+Du3)
+ movsd .L__real_log2e_lead(%rip),%xmm5
+ addsd %xmm1,%xmm2
+ subsd %xmm6,%xmm2 # -correction
+
+# loge to log2
+ movsd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subsd %xmm3,%xmm0
+ addsd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movsd %xmm3,%xmm0
+ movsd %xmm2,%xmm1
+
+ mulsd %xmm4,%xmm2
+ mulsd %xmm4,%xmm0
+ mulsd %xmm5,%xmm1
+ mulsd %xmm5,%xmm3
+ addsd %xmm2,%xmm0
+ addsd %xmm1,%xmm0
+ addsd %xmm3,%xmm0
+
+# return r + r2;
+# addsd %xmm2,%xmm0
+ ret
+
+ .align 16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+ test $1,%r8d # first number?
+ jz .L__lninf2
+
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rdx
+ movlpd p_x(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm1,%xmm0
+
+.L__lninf2:
+ test $2,%r8d # second number?
+ jz .L__lninfe
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rdx
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+
+.L__lninfe:
+
+ cmp $3,%r8d # both numbers?
+ jz .L__finish # return early if so
+ jmp .L__vlog1 # continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__lnan # jump if mantissa not zero, so it's a NaN
+# inf
+ rcl $1,%rdx
+ jnc .L__lne2 # log(+inf) = inf
+# negative x
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+#NaN
+.L__lnan:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__lne:
+ movd %rdx,%xmm0
+.L__lne2:
+ ret
+
+ .align 16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+ test $1,%r9d # first number?
+ jz .L__zn2
+
+# movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rax
+ call .L__zni
+# shufpd $2,%xmm1,%xmm0
+
+.L__zn2:
+ test $2,%r9d # second number?
+ jz .L__zne
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+
+.L__zne:
+ jmp .L__finish
+
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+ shl $1,%rax
+ jnz .L__zn_x ## if just a carry, then must be negative
+ movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0
+ ret
+.L__zn_x:
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+
+
+
+ .data
+ .align 16
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_two: .quad 0x04000000000000000 # 2.0
+ .quad 0x04000000000000000
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0fff0000000000000
+.L__real_inf: .quad 0x07ff0000000000000 # +inf
+ .quad 0x07ff0000000000000
+.L__real_nan: .quad 0x07ff8000000000000 # NaN
+ .quad 0x07ff8000000000000
+
+.L__real_sign: .quad 0x08000000000000000 # sign bit
+ .quad 0x08000000000000000
+.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold: .quad 0x03F9EB85000000000 # .03
+ .quad 0x03F9EB85000000000
+.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit
+ .quad 0x00008000000000000
+.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03f80000000000000
+.L__mask_1023: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+.L__mask_040: .quad 0x00000000000000040 #
+ .quad 0x00000000000000040
+.L__mask_001: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001
+
+.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x03fb55555555554e6
+.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x03f89999999bac6d4
+.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x03f62492307f1519f
+.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x03f3c8034c85dfff0
+
+.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02
+ .quad 0x03fb5555555555557
+.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02
+ .quad 0x03f89999999865ede
+.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03
+ .quad 0x03f6249423bd94741
+.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x03fe62e42e0000000
+.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x03e6efa39ef35793c
+
+.L__real_half: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+.L__real_log2e_lead: .quad 0x03FF7154400000000 # log2e_lead 1.44269180297851562500E+00
+ .quad 0x03FF7154400000000
+.L__real_log2e_tail : .quad 0x03ECB295C17F0BBBE # log2e_tail 3.23791044778235969970E-06
+ .quad 0x03ECB295C17F0BBBE
+.L__mask_lower: .quad 0x0ffffffff00000000
+ .quad 0x0ffffffff00000000
+ .align 16
+
+.L__np_ln_lead_table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02
+ .quad 0x3f9f829800000000 # 3.07716131210327148438e-02
+ .quad 0x3fa7745800000000 # 4.58095073699951171875e-02
+ .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02
+ .quad 0x3fb341d700000000 # 7.52233862876892089844e-02
+ .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02
+ .quad 0x3fba926d00000000 # 1.03796780109405517578e-01
+ .quad 0x3fbe270700000000 # 1.17783010005950927734e-01
+ .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01
+ .quad 0x3fc2955280000000 # 1.45181953907012939453e-01
+ .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01
+ .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01
+ .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01
+ .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01
+ .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01
+ .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01
+ .quad 0x3fce270700000000 # 2.35566020011901855469e-01
+ .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01
+ .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01
+ .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01
+ .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01
+ .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01
+ .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01
+ .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01
+ .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01
+ .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01
+ .quad 0x3fd686c800000000 # 3.51976394653320312500e-01
+ .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01
+ .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01
+ .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01
+ .quad 0x3fd9479400000000 # 3.94993782043457031250e-01
+ .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01
+ .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01
+ .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01
+ .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01
+ .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01
+ .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01
+ .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01
+ .quad 0x3fde744240000000 # 4.75845873355865478516e-01
+ .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01
+ .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01
+ .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01
+ .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01
+ .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01
+ .quad 0x3fe109f380000000 # 5.32464742660522460938e-01
+ .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01
+ .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01
+ .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01
+ .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01
+ .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01
+ .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01
+ .quad 0x3fe307d720000000 # 5.94707071781158447266e-01
+ .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01
+ .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01
+ .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01
+ .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01
+ .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01
+ .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01
+ .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01
+ .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01
+ .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01
+ .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01
+ .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01
+ .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_tail_table:
+ .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00
+ .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09
+ .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08
+ .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08
+ .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08
+ .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08
+ .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08
+ .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08
+ .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08
+ .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08
+ .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08
+ .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08
+ .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08
+ .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10
+ .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08
+ .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08
+ .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08
+ .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08
+ .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08
+ .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08
+ .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08
+ .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08
+ .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08
+ .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08
+ .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09
+ .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09
+ .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08
+ .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08
+ .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08
+ .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08
+ .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09
+ .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08
+ .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08
+ .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08
+ .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08
+ .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08
+ .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09
+ .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08
+ .quad 0x03e33071282fb989b # 4.43021445893361960146e-09
+ .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08
+ .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08
+ .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08
+ .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08
+ .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08
+ .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09
+ .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08
+ .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08
+ .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08
+ .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08
+ .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08
+ .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08
+ .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08
+ .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08
+ .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08
+ .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08
+ .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08
+ .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08
+ .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09
+ .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08
+ .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08
+ .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08
+ .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08
+ .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08
+ .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08
+ .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08
+ .quad 0 # for alignment
+
+
diff --git a/src/gas/vrd2sin.S b/src/gas/vrd2sin.S
new file mode 100644
index 0000000..50c0deb
--- /dev/null
+++ b/src/gas/vrd2sin.S
@@ -0,0 +1,805 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# A vector implementation of the libm sin function.
+#
+# Prototype:
+#
+# __m128d __vrd2_sin(__m128d x);
+#
+# Computes Sine of x
+# It will provide proper C99 return values,
+# but may not raise floating point status bits properly.
+# Based on the NAG C implementation.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__real_ffffffffffffffff: .quad 0x0ffffffffffffffff #Sign bit one
+ .quad 0x0ffffffffffffffff
+
+.Lcosarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03fa5555555555555
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6
+ .quad 0x0bda907db46cc5e42
+.Lsinarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bfc5555555555555
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03f81111111110bb3
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0bf2a01a019e83e5c
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03ec71de3796cde01
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0be5ae600b42fdfa7
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x03de5e0b2f9a43bb8
+.Lsincosarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x0bf56c16c16c16967 # c2
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x0bda907db46cc5e42
+.Lcossinarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bf56c16c16c16967 # c2
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0bda907db46cc5e42
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .text
+ .align 16
+ .p2align 4,,15
+
+.equ p_temp, 0x00 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for get/put bits operation
+.equ p_temp2,0x20 # temporary for get/put bits operation
+.equ p_xmm6, 0x30 # temporary for get/put bits operation
+.equ p_xmm7, 0x40 # temporary for get/put bits operation
+.equ p_xmm8, 0x50 # temporary for get/put bits operation
+.equ p_xmm9, 0x60 # temporary for get/put bits operation
+.equ p_xmm10,0x70 # temporary for get/put bits operation
+.equ p_xmm11,0x80 # temporary for get/put bits operation
+.equ p_xmm12,0x90 # temporary for get/put bits operation
+.equ p_xmm13,0x0A0 # temporary for get/put bits operation
+.equ p_xmm14,0x0B0 # temporary for get/put bits operation
+.equ p_xmm15,0x0C0 # temporary for get/put bits operation
+.equ r, 0x0D0 # pointer to r for remainder_piby2
+.equ rr, 0x0E0 # pointer to r for remainder_piby2
+.equ region, 0x0F0 # pointer to r for remainder_piby2
+.equ p_original,0x100 # original x
+.equ p_mask, 0x110 # original x
+.equ p_sign, 0x120 # original x
+
+.globl __vrd2_sin
+ .type __vrd2_sin,@function
+__vrd2_sin:
+
+ sub $0x138,%rsp
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+movdqa %xmm0,%xmm6 #move to mem to get into integer regs **
+andpd .L__real_7fffffffffffffff(%rip), %xmm0 #Unsign -
+
+movd %xmm0,%rax #rax is lower arg +
+movhpd %xmm0, p_temp+8(%rsp) # +
+mov p_temp+8(%rsp),%rcx #rcx = upper arg +
+movdqa %xmm0,%xmm1
+
+ #This will mask all nan/infs also
+pcmpgtd %xmm6,%xmm1
+movdqa %xmm1,%xmm6
+psrldq $4, %xmm1
+psrldq $8, %xmm6
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+
+movapd .L__real_3fe0000000000000(%rip), %xmm5 #0.5 for later use +
+
+por %xmm1,%xmm6
+movd %xmm6,%r11 #Move Sign to gpr **
+
+movapd %xmm0,%xmm2 #x +
+movapd %xmm0,%xmm4 #x +
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+
+ cmp %r10,%rax #is lower arg >= 5e5
+ jae .Llower_or_both_arg_gt_5e5
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lupper_arg_gt_5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lboth_arg_lt_than_5e5:
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ movapd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3=piby2_1
+ addpd %xmm5,%xmm2 # xmm2 = npi2 = x*twobypi+0.5
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1=piby2_2
+ movapd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6=piby2_2tail
+ cvttpd2dq %xmm2,%xmm0 # xmm0=convert npi2 to ints
+ cvtdq2pd %xmm0,%xmm2 # xmm2=and back to double.
+
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm3 # npi2 * piby2_1
+ subpd %xmm3,%xmm4 # xmm4 = rhead=x-npi2*piby2_1
+
+#t = rhead;
+ movapd %xmm4,%xmm5 # xmm5=t=rhead
+
+#rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm1 # xmm1= npi2*piby2_2
+
+#rhead = t - rtail;
+ subpd %xmm1,%xmm4 # xmm4= rhead = t-rtail
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd %xmm2,%xmm6 # npi2 * piby2_2tail
+ subpd %xmm4,%xmm5 # t-rhead
+ subpd %xmm5,%xmm1 # rtail-(t - rhead)
+ addpd %xmm6,%xmm1 # rtail=npi2*piby2_2+(rtail-(t-rhead))
+
+#r = rhead - rtail
+#rr=(rhead-r) -rtail
+#Sign
+#Region
+ movdqa %xmm0,%xmm5 # Region +
+ movd %xmm0,%r10 # Sign
+ movdqa %xmm4,%xmm0 # rhead (handle xmm0 retype) +
+
+ subpd %xmm1,%xmm0 # rhead - rtail +
+ pand .L__reald_one_one(%rip),%xmm5 # Odd/Even region for Cos/Sin +
+ mov .L__reald_one_zero(%rip),%r9 # Compare value for cossin +
+ subpd %xmm0,%xmm4 # rr=rhead-r +
+ movd %xmm5,%r8 # Region +
+ movapd %xmm0,%xmm2 # Move for x2 +
+ mulpd %xmm0,%xmm2 # x2 +
+ subpd %xmm1,%xmm4 # rr=(rhead-r) -rtail +
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ mov %r10,%rcx
+ not %r11 #ADDED TO CHANGE THE LOGIC
+ and %r11,%r10
+ not %rcx
+ not %r11
+ and %r11,%rcx
+ or %rcx,%r10
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+
+ mov %r10,%r11
+ and %r9,%r11 #mask out the lower sign bit leaving the upper sign bit
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $31,%r11 #shift upper sign bit left by 31 bits
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r11,p_sign+8(%rsp) #write out upper sign bit
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign
+
+.align 16
+.L__vrd2_sin_approximate:
+ cmp $0,%r8
+ jnz .Lvrd2_not_sin_piby4
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lvrd2_sin_piby4:
+ movapd .Lsinarray+0x50(%rip),%xmm3 # s6
+ movapd .Lsinarray+0x20(%rip),%xmm5 # s3
+ movapd %xmm2,%xmm1 # move for x4
+
+ mulpd %xmm2,%xmm3 # x2s6
+ mulpd %xmm2,%xmm5 # x2s3
+ mulpd %xmm2,%xmm1 # x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm3 # s5+x2s6
+ movapd %xmm2,%xmm6 # move for x3
+ addpd .Lsinarray+0x10(%rip),%xmm5 # s2+x2s3
+
+ mulpd %xmm2,%xmm3 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm5 # x2(s2+x2s3)
+ mulpd %xmm2,%xmm1 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6)
+ addpd .Lsinarray(%rip),%xmm5 # s1+x2(s2+x2s3)
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2
+
+ mulpd %xmm1,%xmm3 # x6(s4 + x2(s5+x2s6))
+ mulpd %xmm0,%xmm6 # x3
+ addpd %xmm5,%xmm3 # zs
+ mulpd %xmm4,%xmm2 # 0.5 * x2 *xx
+
+ mulpd %xmm3,%xmm6 # x3*zs
+ subpd %xmm2,%xmm6 # x3*zs - 0.5 * x2 *xx
+ addpd %xmm4,%xmm6 # +xx
+ addpd %xmm6,%xmm0 # +x
+ xorpd p_sign(%rsp),%xmm0 # xor sign
+ jmp .L__vrd2_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign
+.align 16
+.Lvrd2_not_sin_piby4:
+ cmp $1,%r8
+ jnz .Lvrd2_not_sin_cos_piby4
+
+.Lvrd2_sin_cos_piby4:
+
+ movapd %xmm4,p_temp(%rsp) # rr move to to memory
+ movapd %xmm0,p_temp1(%rsp) # r move to to memory
+
+ movapd .Lcossinarray+0x50(%rip),%xmm3 # s6
+ mulpd %xmm2,%xmm3 # x2s6
+ movdqa .Lcossinarray+0x20(%rip),%xmm5 # s3
+ movapd %xmm2,%xmm1 # move x2 for x4
+ mulpd %xmm2,%xmm1 # x4
+ mulpd %xmm2,%xmm5 # x2s3
+
+ addpd .Lcossinarray+0x40(%rip),%xmm3 # s5+x2s6
+ movapd %xmm2,%xmm4 # move for x6
+ mulpd %xmm2,%xmm3 # x2(s5+x2s6)
+ mulpd %xmm1,%xmm4 # x6
+ addpd .Lcossinarray+0x10(%rip),%xmm5 # s2+x2s3
+ mulpd %xmm2,%xmm5 # x2(s2+x2s3)
+ addpd .Lcossinarray+0x30(%rip),%xmm3 # s4 + x2(s5+x2s6)
+
+ movhlps %xmm0,%xmm0 # high of x for x3
+ mulpd %xmm4,%xmm3 # x6(s4 + x2(s5+x2s6))
+ addpd .Lcossinarray(%rip),%xmm5 # s1+x2(s2+x2s3)
+
+ movhlps %xmm2,%xmm4 # high of x2 for x3
+ addpd %xmm5,%xmm3 # z
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2
+ mulsd %xmm0,%xmm4 # x3 #
+ movhlps %xmm3,%xmm5 # xmm5 = sin
+ # xmm3 = cos
+
+ mulsd %xmm4,%xmm5 # sin*x3 #
+ movsd .L__real_3ff0000000000000(%rip),%xmm4 # 1.0 #
+ mulsd %xmm1,%xmm3 # cos*x4 #
+
+ subsd %xmm2,%xmm4 # t=1.0-r #
+
+ movhlps %xmm2,%xmm6 # move 0.5 * x2 for 0.5 * x2 * xx #
+ mulsd p_temp+8(%rsp),%xmm6 # 0.5 * x2 * xx #
+ subsd %xmm6,%xmm5 # sin - 0.5 * x2 *xx #
+ addsd p_temp+8(%rsp),%xmm5 # sin+xx #
+
+ movlpd p_temp1(%rsp),%xmm6 # x
+ mulsd p_temp(%rsp),%xmm6 # x *xx #
+
+ movsd .L__real_3ff0000000000000(%rip),%xmm1 # 1 #
+ subsd %xmm4,%xmm1 # 1 -t #
+ addsd %xmm5,%xmm0 # sin+x #
+ subsd %xmm2,%xmm1 # (1-t) - r #
+ subsd %xmm6,%xmm1 # ((1 + (-t)) - r) - x*xx #
+ addsd %xmm1,%xmm3 # cos+((1 + (-t)) - r) - x*xx #
+ addsd %xmm4,%xmm3 # cos+t #
+
+ movapd p_sign(%rsp),%xmm2 # load sign
+ movlhps %xmm0,%xmm3
+ movapd %xmm3,%xmm0
+ xorpd %xmm2,%xmm0
+ jmp .L__vrd2_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#xmm0= x, xmm2 = x2, xmm4 = xx, r8 = region, r9 = compare value for sincos path, xmm6 = Sign
+.align 16
+.Lvrd2_not_sin_cos_piby4:
+ cmp %r9,%r8
+ jnz .Lvrd2_cos_piby4
+
+.Lvrd2_cos_sin_piby4:
+
+ movapd %xmm4,p_temp(%rsp) # Store rr
+ movapd .Lsincosarray+0x50(%rip),%xmm3 # s6
+ mulpd %xmm2,%xmm3 # x2s6
+ movdqa .Lsincosarray+0x20(%rip),%xmm5 # s3 (handle xmm5 retype)
+ movapd %xmm2,%xmm1 # move x2 for x4
+ mulpd %xmm2,%xmm1 # x4
+ mulpd %xmm2,%xmm5 # x2s3
+ addpd .Lsincosarray+0x40(%rip),%xmm3 # s5+x2s6
+ movapd %xmm2,%xmm4 # move x2 for x6
+ mulpd %xmm2,%xmm3 # x2(s5+x2s6)
+ mulpd %xmm1,%xmm4 # x6
+ addpd .Lsincosarray+0x10(%rip),%xmm5 # s2+x2s3
+ mulpd %xmm2,%xmm5 # x2(s2+x2s3)
+ addpd .Lsincosarray+0x30(%rip),%xmm3 # s4+x2(s5+x2s6)
+
+ movhlps %xmm1,%xmm1 # move high x4 for cos
+ mulpd %xmm4,%xmm3 # x6(s4+x2(s5+x2s6))
+ addpd .Lsincosarray(%rip),%xmm5 # s1+x2(s2+x2s3)
+ movapd %xmm2,%xmm4 # move low x2 for x3
+ mulsd %xmm0,%xmm4 # get low x3 for sin term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2
+
+ addpd %xmm3,%xmm5 # z
+ movhlps %xmm2,%xmm6 # move high r for cos
+ movhlps %xmm5,%xmm3 # xmm5 = sin
+ # xmm3 = cos
+ mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx
+
+ mulsd %xmm4,%xmm5 # sin *x3
+ movsd .L__real_3ff0000000000000(%rip),%xmm4 # 1.0
+ mulsd %xmm1,%xmm3 # cos *x4
+ subsd %xmm6,%xmm4 # t=1.0-r
+
+ movhlps %xmm0,%xmm1
+ subsd %xmm2,%xmm5 # sin - 0.5 * x2 *xx
+
+ mulsd p_temp+8(%rsp),%xmm1 # x * xx
+ movsd .L__real_3ff0000000000000(%rip),%xmm2 # 1
+ subsd %xmm4,%xmm2 # 1 - t
+ addsd p_temp(%rsp),%xmm5 # sin+xx
+
+ subsd %xmm6,%xmm2 # (1-t) - r
+ subsd %xmm1,%xmm2 # ((1 + (-t)) - r) - x*xx
+ addsd %xmm5,%xmm0 # sin + x
+ addsd %xmm2,%xmm3 # cos+((1-t)-r - x*xx)
+ addsd %xmm4,%xmm3 # cos+t
+
+ movapd p_sign(%rsp),%xmm5 # load sign
+ movlhps %xmm3,%xmm0
+ xorpd %xmm5,%xmm0
+ jmp .L__vrd2_sin_cleanup
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+
+.Lvrd2_cos_piby4:
+ mulpd %xmm0,%xmm4 # x*xx
+ movdqa .L__real_3fe0000000000000(%rip),%xmm5 # 0.5 (handle xmm5 retype)
+ movapd .Lcosarray+0x50(%rip),%xmm1 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm0 # c3
+ mulpd %xmm2,%xmm5 # r = 0.5 *x2
+ movapd %xmm2,%xmm3 # copy of x2 for x4
+ movapd %xmm4,p_temp(%rsp) # store x*xx
+ mulpd %xmm2,%xmm1 # c6*x2
+ mulpd %xmm2,%xmm0 # c3*x2
+ subpd .L__real_3ff0000000000000(%rip),%xmm5 # -t=r-1.0
+ mulpd %xmm2,%xmm3 # x4
+ addpd .Lcosarray+0x40(%rip),%xmm1 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm0 # c2+x2C3
+ addpd .L__real_3ff0000000000000(%rip),%xmm5 # 1 + (-t)
+ mulpd %xmm2,%xmm3 # x6
+ mulpd %xmm2,%xmm1 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm0 # x2(c2+x2C3)
+ movapd %xmm2,%xmm4 # copy of x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm4 # r = 0.5 *x2
+ addpd .Lcosarray+0x30(%rip),%xmm1 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm0 # c1+x2(c2+x2C3)
+ mulpd %xmm2,%xmm2 # x4
+ subpd %xmm4,%xmm5 # (1 + (-t)) - r
+ mulpd %xmm3,%xmm1 # x6(c4 + x2(c5+x2c6))
+ addpd %xmm1,%xmm0 # zc
+ subpd .L__real_3ff0000000000000(%rip),%xmm4 # -t=r-1.0
+ subpd p_temp(%rsp),%xmm5 # ((1 + (-t)) - r) - x*xx
+ mulpd %xmm2,%xmm0 # x4 * zc
+ addpd %xmm5,%xmm0 # x4 * zc + ((1 + (-t)) - r -x*xx)
+ subpd %xmm4,%xmm0 # result - (-t)
+ xorpd p_sign(%rsp),%xmm0 # xor with sign
+ jmp .L__vrd2_sin_cleanup
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Llower_or_both_arg_gt_5e5:
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+
+ movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call
+
+ movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm4,%xmm4
+
+# Work on Upper arg
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+
+#If upper Arg is <=piby4
+ cmp %rdx,%rcx # is upper arg > piby4
+ ja 0f
+
+ mov $0,%ecx # region = 0
+ mov %ecx,region+4(%rsp) # store upper region
+ movlpd %xmm0,r+8(%rsp) # store upper r (unsigned - sign is adjusted later based on sign)
+ xorpd %xmm4,%xmm4 # rr = 0
+ movlpd %xmm4,rr+8(%rsp) # store upper rr
+ jmp .Lcheck_lower_arg
+
+#If upper Arg is > piby4
+.align 16
+0:
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm5,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%ecx # xmm0 = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+ #/* Subtract the multiple from x to get an extra-precision remainder */
+ #rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm3 # npi2 * piby2_1
+ subsd %xmm3,%xmm4 # xmm4 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+ #t = rhead;
+ movsd %xmm4,%xmm5 # xmm5 = t = rhead
+
+ #rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm1 # xmm1 =rtail=(npi2*piby2_2)
+
+ #rhead = t - rtail
+ subsd %xmm1,%xmm4 # xmm4 =rhead=(t-rtail)
+
+ #rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm4,%xmm5 # t-rhead
+ subsd %xmm5,%xmm1 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+ #r = rhead - rtail
+ #rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm4,%xmm0
+ subsd %xmm1,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm4 # rr=rhead-r
+ subsd %xmm1,%xmm4 # xmm4 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r+8(%rsp) # store upper r
+ movlpd %xmm4,rr+8(%rsp) # store upper rr
+
+#If lower Arg is > 5e5
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_lower_arg:
+ mov $0x07ff0000000000000,%r9 # is lower arg nan/inf
+ mov %r9,%r10
+ and %rax,%r10
+ cmp %r9,%r10
+ jz .L__vrd2_cos_lower_naninf
+
+ mov %r11,p_temp(%rsp) #Save Sign
+
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r11 #Restore Sign
+
+ jmp .L__vrd2_sin_reconstruct
+
+.L__vrd2_cos_lower_naninf:
+ mov r(%rsp),%rax
+ mov $0x00008000000000000,%r9
+ or %r9,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) # rr = 0
+ mov %r10d,region(%rsp) # region =0
+
+ jmp .L__vrd2_sin_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lupper_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+ movlhps %xmm0,%xmm0 #Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+ movlhps %xmm2,%xmm2
+ movlhps %xmm4,%xmm4
+
+# Work on Lower arg
+# %xmm2,,%xmm0 xmm4 = x, xmm5 = 0.5
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+#If lower Arg is <=piby4
+ cmp %rdx,%rax # is upper arg > piby4
+ ja 0f
+
+ mov $0,%eax # region = 0
+ mov %eax,region(%rsp) # store upper region
+ movlpd %xmm0,r(%rsp) # store upper r
+ xorpd %xmm4,%xmm4 # rr = 0
+ movlpd %xmm4,rr(%rsp) # store upper rr
+ jmp .Lcheck_upper_arg
+
+.align 16
+0:
+#If upper Arg is > piby4
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm5,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm3 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # xmm0 = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm3 # npi2 * piby2_1
+ subsd %xmm3,%xmm4 # xmm4 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm4,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm1 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm1,%xmm4 # xmm4 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm4,%xmm5 # t-rhead
+ subsd %xmm5,%xmm1 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store lower region
+ movsd %xmm4,%xmm0
+ subsd %xmm1,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm4 # rr=rhead-r
+ subsd %xmm1,%xmm4 # xmm4 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r(%rsp) # store lower r
+ movlpd %xmm4,rr(%rsp) # store lower rr
+
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_upper_arg:
+ mov $0x07ff0000000000000,%r9 # is upper arg nan/inf
+ mov %r9,%r10
+ and %rcx,%r10
+ cmp %r9,%r10
+ jz .L__vrd2_cos_upper_naninf
+
+ mov %r11,p_temp(%rsp) #Save Sign
+
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r11 #Restore Sign
+
+ jmp .L__vrd2_sin_reconstruct
+
+.L__vrd2_cos_upper_naninf:
+ mov r+8(%rsp),%rcx # upper arg is nan/inf
+ mov $0x00008000000000000,%r9
+ or %r9,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) # rr = 0
+ mov %r10d,region+4(%rsp) # region =0
+ jmp .L__vrd2_sin_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+
+ movhpd %xmm0,p_temp2(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r9 #is lower arg nan/inf
+ mov %r9,%r10
+ and %rax,%r10
+ cmp %r9,%r10
+ jz .L__vrd2_cos_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r11,p_temp1(%rsp) #Save Sign
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp1(%rsp),%r11 #Restore Sign
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrd2_cos_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+ movd %xmm0,%rax
+ mov $0x00008000000000000,%r9
+ or %r9,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) #rr = 0
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r9 #is upper arg nan/inf
+ mov %r9,%r10
+ and %rcx,%r10
+ cmp %r9,%r10
+ jz .L__vrd2_cos_upper_naninf_of_both_gt_5e5
+
+
+ mov %r11,p_temp(%rsp) #Save Sign
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd p_temp2(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r11 #Restore Sign
+
+ jmp 0f
+
+.L__vrd2_cos_upper_naninf_of_both_gt_5e5:
+ mov p_temp2(%rsp),%rcx #upper arg is nan/inf
+ mov $0x00008000000000000,%r9
+ or %r9,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) #rr = 0
+ mov %r10d,region+4(%rsp) #region = 0
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+0:
+.L__vrd2_sin_reconstruct:
+#Construct xmm0=x, xmm2 =x2, xmm4=xx, r8=region, xmm6=sign
+ movapd r(%rsp),%xmm0 #x
+ movapd %xmm0,%xmm2 #move for x2
+ mulpd %xmm2,%xmm2 #x2
+ movapd rr(%rsp),%xmm4 #xx
+
+ mov region(%rsp),%r8
+ mov .L__reald_one_zero(%rip),%r9 #compare value for cossin path
+ mov %r8,%r10
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ mov %r10,%rcx
+ not %r11 #ADDED TO CHANGE THE LOGIC
+ and %r11,%r10
+ not %rcx
+ not %r11
+ and %r11,%rcx
+ or %rcx,%r10
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+
+ mov %r10,%r11
+ and %r9,%r11 #mask out the lower sign bit leaving the upper sign bit
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $31,%r11 #shift upper sign bit left by 31 bits
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r11,p_sign+8(%rsp) #write out upper sign bit
+
+ jmp .L__vrd2_sin_approximate
+#ENDMAIN
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd2_sin_cleanup:
+ add $0x138,%rsp
+ ret
+
diff --git a/src/gas/vrd2sincos.S b/src/gas/vrd2sincos.S
new file mode 100644
index 0000000..b25bb37
--- /dev/null
+++ b/src/gas/vrd2sincos.S
@@ -0,0 +1,968 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# A vector implementation of the libm sincos function.
+#
+# Prototype:
+#
+# __vrd2_sincos(__m128d x, __m128d* ys, __m128d* yc);
+#
+# Computes Sine and Cosine of x.
+# It will provide proper C99 return values,
+# but may not raise floating point status bits properly.
+# Based on the NAG C implementation.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__real_ffffffffffffffff: .quad 0x0ffffffffffffffff #Sign bit one
+ .quad 0x0ffffffffffffffff
+.L__real_naninf_upper_sign_mask: .quad 0x000000000ffffffff #
+ .quad 0x000000000ffffffff #
+.L__real_naninf_lower_sign_mask: .quad 0x0ffffffff00000000 #
+ .quad 0x0ffffffff00000000 #
+
+.Lcosarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03fa5555555555555
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6
+ .quad 0x0bda907db46cc5e42
+.Lsinarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bfc5555555555555
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03f81111111110bb3
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0bf2a01a019e83e5c
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03ec71de3796cde01
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0be5ae600b42fdfa7
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x03de5e0b2f9a43bb8
+.Lsincosarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x0bf56c16c16c16967 # c2
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x0bda907db46cc5e42
+.Lcossinarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bf56c16c16c16967 # c2
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0bda907db46cc5e42
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+
+
+.text
+.align 16
+.p2align 4,,15
+
+.equ p_temp, 0x00 # temporary for get/put bits operation
+.equ p_temp1, 0x10 # temporary for get/put bits operation
+.equ p_temp2, 0x20 # temporary for get/put bits operation
+
+.equ save_xmm6, 0x30 # temporary for get/put bits operation
+.equ save_xmm7, 0x40 # temporary for get/put bits operation
+.equ save_xmm8, 0x50 # temporary for get/put bits operation
+.equ save_xmm9, 0x60 # temporary for get/put bits operation
+.equ save_xmm10, 0x70 # temporary for get/put bits operation
+.equ save_xmm11, 0x80 # temporary for get/put bits operation
+.equ save_xmm12, 0x90 # temporary for get/put bits operation
+.equ save_xmm13, 0x0A0 # temporary for get/put bits operation
+.equ save_xmm14, 0x0B0 # temporary for get/put bits operation
+.equ save_xmm15, 0x0C0 # temporary for get/put bits operation
+
+.equ save_rdi, 0x0D0
+.equ save_rsi, 0x0E0
+
+.equ r, 0x0F0 # pointer to r for remainder_piby2
+.equ rr, 0x0100 # pointer to r for remainder_piby2
+.equ region, 0x0110 # pointer to r for remainder_piby2
+
+.equ p_original, 0x0120 # original x
+.equ p_mask, 0x0130 # original x
+.equ p_sign, 0x0140 # original x
+.equ p_sign1, 0x0150 # original x
+.equ p_x, 0x0160 #x
+.equ p_xx, 0x0170 #xx
+.equ p_x2, 0x0180 #x2
+.equ p_sin, 0x0190 #sin
+.equ p_cos, 0x01A0 #cos
+.equ p_temp2, 0x01B0 # temporary for get/put bits operation
+
+.globl __vrd2_sincos
+ .type __vrd2_sincos,@function
+__vrd2_sincos:
+ sub $0x1C8,%rsp
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+movdqa %xmm0,%xmm6 #move to mem to get into integer regs **
+movdqa %xmm0, p_original(%rsp) #move to mem to get into integer regs -
+
+andpd .L__real_7fffffffffffffff(%rip),%xmm0 #Unsign -
+
+mov %rdi, p_sin(%rsp) # save address for sin return
+mov %rsi, p_cos(%rsp) # save address for cos return
+
+movd %xmm0,%rax #rax is lower arg
+movhpd %xmm0, p_temp+8(%rsp) #
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+movdqa %xmm0,%xmm8
+
+pcmpgtd %xmm6,%xmm8
+movdqa %xmm8,%xmm6
+psrldq $4,%xmm8
+psrldq $8,%xmm6
+
+mov $0x3FE921FB54442D18,%rdx #piby4
+mov $0x411E848000000000,%r10 #5e5
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use
+
+por %xmm6,%xmm8
+movd %xmm8,%r11 #Move Sign to gpr **
+
+movapd %xmm0,%xmm2 #x
+movapd %xmm0,%xmm6 #x
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+
+ cmp %r10,%rax #is lower arg >= 5e5
+ jae .Llower_or_both_arg_gt_5e5
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lupper_arg_gt_5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lboth_arg_lt_than_5e5:
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%r8 # Region
+
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+ mov %r8,%r10
+ mov %r8,%rcx
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ mov %r10,%rax
+ not %r11 #ADDED TO CHANGE THE LOGIC
+ and %r11,%r10
+ not %rax
+ not %r11
+ and %r11,%rax
+ or %rax,%r10
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ mov %r10,%r11
+ and %rdx,%r11 #mask out the lower sign bit leaving the upper sign bit
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $31,%r11 #shift upper sign bit left by 31 bits
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r11,p_sign+8(%rsp) #write out upper sign bit
+
+# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+
+ subpd %xmm0,%xmm6 #rr=rhead-r
+ movapd %xmm0,%xmm2 #move r for r2
+ mulpd %xmm0,%xmm2 #r2
+ subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail
+
+ mov .L__reald_one_zero(%rip),%r9 # Compare value for cossin +
+
+
+ add .L__reald_one_one(%rip),%rcx
+ and .L__reald_two_two(%rip),%rcx
+ shr $1,%rcx
+
+ mov %rcx,%rdx
+ and %r9,%rdx #mask out the lower sign bit leaving the upper sign bit
+ shl $63,%rcx #shift lower sign bit left by 63 bits
+ shl $31,%rdx #shift upper sign bit left by 31 bits
+ mov %rcx,p_sign1(%rsp) #write out lower sign bit
+ mov %rdx,p_sign1+8(%rsp) #write out upper sign bit
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 16
+.L__vrd2_sincos_approximate:
+ cmp $0,%r8
+ jnz .Lvrd2_not_sin_piby4
+
+.Lvrd2_sin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm2,%xmm11 # x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm2,%xmm5 # c6*x2
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm2,%xmm9 # c3*x2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ movapd %xmm2,p_temp(%rsp) # store x2
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ movapd %xmm10,p_temp2(%rsp) # store r
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ mulpd %xmm2,%xmm11 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm5 # x2(c5+x2c6)
+ movapd %xmm10,p_temp1(%rsp) # store t
+ movapd %xmm11,%xmm3 # Keep x4
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm2,%xmm9 # x2(c2+x2C3)
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ mulpd %xmm2,%xmm11 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ subpd p_temp2(%rsp),%xmm10 # (1 + (-t)) - r
+ mulpd %xmm0,%xmm2 # x3 recalculate
+
+ mulpd %xmm11,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ movapd %xmm0,%xmm1
+ movapd %xmm6,%xmm7
+ mulpd %xmm6,%xmm1 # x*xx
+ mulpd p_temp2(%rsp),%xmm7 # xx * 0.5x2
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zs
+
+ subpd %xmm1,%xmm10 # ((1 + (-t)) - r) -x*xx
+
+ mulpd %xmm3,%xmm4 # x4 * zc
+ mulpd %xmm2,%xmm5 # x3 * zs
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ subpd %xmm7,%xmm5 # x3*zs - 0.5 * x2 *xx
+
+ addpd %xmm6,%xmm5 # sin + xx
+ subpd p_temp1(%rsp),%xmm4 # cos - (-t)
+ addpd %xmm0,%xmm5 # sin + x
+
+ jmp .L__vrd2_sincos_cleanup
+
+.align 16
+.Lvrd2_not_sin_piby4:
+ cmp .L__reald_one_one(%rip),%r8
+ jnz .Lvrd2_not_cos_piby4
+
+.Lvrd2_cos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm2,%xmm11 # x2
+
+ mulpd %xmm2,%xmm5 # c6*x2
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm2,%xmm9 # c3*x2
+ mulpd %xmm2,%xmm8 # c3*x2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ movapd %xmm2,p_temp(%rsp) # store x2
+
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ movapd %xmm10,p_temp2(%rsp) # store r
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ mulpd %xmm2,%xmm11 # x4
+
+ mulpd %xmm2,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ movapd %xmm10,p_temp1(%rsp) # store t
+ movapd %xmm11,%xmm3 # Keep x4
+ mulpd %xmm2,%xmm9 # x2(c2+x2C3)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ mulpd %xmm2,%xmm11 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+
+ subpd p_temp2(%rsp),%xmm10 # (1 + (-t)) - r
+ mulpd %xmm0,%xmm2 # x3 recalculate
+
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm4 # x6(c4 + x2(c5+x2c6))
+
+ movapd %xmm0,%xmm1
+ movapd %xmm6,%xmm7
+ mulpd %xmm6,%xmm1 # x*xx
+ mulpd p_temp2(%rsp),%xmm7 # xx * 0.5x2
+
+ addpd %xmm9,%xmm5 # zc
+ addpd %xmm8,%xmm4 # zs
+
+ subpd %xmm1,%xmm10 # ((1 + (-t)) - r) -x*xx
+
+ mulpd %xmm3,%xmm5 # x4 * zc
+ mulpd %xmm2,%xmm4 # x3 * zs
+
+ addpd %xmm10,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ subpd %xmm7,%xmm4 # x3*zs - 0.5 * x2 *xx
+
+ addpd %xmm6,%xmm4 # sin + xx
+ subpd p_temp1(%rsp),%xmm5 # cos - (-t)
+ addpd %xmm0,%xmm4 # sin + x
+
+ jmp .L__vrd2_sincos_cleanup
+
+.align 16
+.Lvrd2_not_cos_piby4:
+ cmp $1,%r8
+ jnz .Lvrd2_cossin_piby4
+
+.Lvrd2_sincos_piby4:
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm2,%xmm11 # x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm2,%xmm5 # c6*x2
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm2,%xmm9 # c3*x2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ movapd %xmm2,p_temp(%rsp) # store x2
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ movapd %xmm10,p_temp2(%rsp) # store r
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ mulpd %xmm2,%xmm11 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm5 # x2(c5+x2c6)
+ movapd %xmm10,p_temp1(%rsp) # store t
+ movapd %xmm11,%xmm3 # Keep x4
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm2,%xmm9 # x2(c2+x2C3)
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ mulpd %xmm2,%xmm11 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ subpd p_temp2(%rsp),%xmm10 # (1 + (-t)) - r
+ mulpd %xmm0,%xmm2 # x3 recalculate
+
+ mulpd %xmm11,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ movapd %xmm0,%xmm1
+ movapd %xmm6,%xmm7
+ mulpd %xmm6,%xmm1 # x*xx
+ mulpd p_temp2(%rsp),%xmm7 # xx * 0.5x2
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zs
+
+ subpd %xmm1,%xmm10 # ((1 + (-t)) - r) -x*xx
+
+ mulpd %xmm3,%xmm4 # x4 * zc
+ mulpd %xmm2,%xmm5 # x3 * zs
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ subpd %xmm7,%xmm5 # x3*zs - 0.5 * x2 *xx
+
+ addpd %xmm6,%xmm5 # sin + xx
+ subpd p_temp1(%rsp),%xmm4 # cos - (-t)
+ addpd %xmm0,%xmm5 # sin + x
+
+ movsd %xmm4,%xmm1
+ movsd %xmm5,%xmm4
+ movsd %xmm1,%xmm5
+
+ jmp .L__vrd2_sincos_cleanup
+
+.align 16
+.Lvrd2_cossin_piby4:
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm2,%xmm11 # x2
+
+ mulpd %xmm2,%xmm5 # c6*x2
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm2,%xmm9 # c3*x2
+ mulpd %xmm2,%xmm8 # c3*x2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ movapd %xmm2,p_temp(%rsp) # store x2
+
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ movapd %xmm10,p_temp2(%rsp) # store r
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ mulpd %xmm2,%xmm11 # x4
+
+ mulpd %xmm2,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ movapd %xmm10,p_temp1(%rsp) # store t
+ movapd %xmm11,%xmm3 # Keep x4
+ mulpd %xmm2,%xmm9 # x2(c2+x2C3)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ mulpd %xmm2,%xmm11 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+
+ subpd p_temp2(%rsp),%xmm10 # (1 + (-t)) - r
+ mulpd %xmm0,%xmm2 # x3 recalculate
+
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm4 # x6(c4 + x2(c5+x2c6))
+
+ movapd %xmm0,%xmm1
+ movapd %xmm6,%xmm7
+ mulpd %xmm6,%xmm1 # x*xx
+ mulpd p_temp2(%rsp),%xmm7 # xx * 0.5x2
+
+ addpd %xmm9,%xmm5 # zc
+ addpd %xmm8,%xmm4 # zs
+
+ subpd %xmm1,%xmm10 # ((1 + (-t)) - r) -x*xx
+
+ mulpd %xmm3,%xmm5 # x4 * zc
+ mulpd %xmm2,%xmm4 # x3 * zs
+
+ addpd %xmm10,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ subpd %xmm7,%xmm4 # x3*zs - 0.5 * x2 *xx
+
+ addpd %xmm6,%xmm4 # sin + xx
+ subpd p_temp1(%rsp),%xmm5 # cos - (-t)
+ addpd %xmm0,%xmm4 # sin + x
+
+ movsd %xmm5,%xmm1
+ movsd %xmm4,%xmm5
+ movsd %xmm1,%xmm4
+
+ jmp .L__vrd2_sincos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Llower_or_both_arg_gt_5e5:
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+
+ movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call
+
+ movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+
+#If upper Arg is <=piby4
+ cmp %rdx,%rcx # is upper arg > piby4
+ ja 0f
+
+ mov $0,%ecx # region = 0
+ mov %ecx,region+4(%rsp) # store upper region
+ movlpd %xmm0,r+8(%rsp) # store upper r (unsigned - sign is adjusted later based on sign)
+ xorpd %xmm4,%xmm4 # rr = 0
+ movlpd %xmm4,rr+8(%rsp) # store upper rr
+ jmp .Lcheck_lower_arg
+
+#If upper Arg is > piby4
+.align 16
+0:
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm3 # piby2_1
+ cvttsd2si %xmm2,%ecx # npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # piby2_2
+ cvtsi2sd %ecx,%xmm2 # npi2 trunc to doubles
+
+ #/* Subtract the multiple from x to get an extra-precision remainder */
+ #rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm3 # npi2 * piby2_1
+ subsd %xmm3,%xmm6 # rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm8 # piby2_2tail
+
+ #t = rhead;
+ movsd %xmm6,%xmm5 # t = rhead
+
+ #rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm1 # rtail=(npi2*piby2_2)
+
+ #rhead = t - rtail
+ subsd %xmm1,%xmm6 # rhead=(t-rtail)
+
+ #rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm8 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm1 # (rtail-(t-rhead))
+ addsd %xmm8,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+ #r = rhead - rtail
+ #rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm1,%xmm0 # r=(rhead-rtail)
+
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm1,%xmm6 # xmm4 = rr=((rhead-r) -rtail)
+
+ movlpd %xmm0,r+8(%rsp) # store upper r
+ movlpd %xmm6,rr+8(%rsp) # store upper rr
+
+#If lower Arg is > 5e5
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_lower_arg:
+ mov $0x07ff0000000000000,%r9 # is lower arg nan/inf
+ mov %r9,%r10
+ and %rax,%r10
+ cmp %r9,%r10
+ jz .L__vrd2_cos_lower_naninf
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ mov %r11,p_temp(%rsp) #Save Sign
+ call __amd_remainder_piby2@PLT
+ mov p_temp(%rsp),%r11 #Restore Sign
+
+ jmp .L__vrd2_cos_reconstruct
+
+.L__vrd2_cos_lower_naninf:
+ mov p_original(%rsp),%rax # upper arg is nan/inf
+
+ mov $0x00008000000000000,%r9
+ or %r9,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) # rr = 0
+ mov %r10d,region(%rsp) # region =0
+ and .L__real_naninf_lower_sign_mask(%rip),%r11 # Sign
+
+ jmp .L__vrd2_cos_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lupper_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm0,%xmm0 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm2,%xmm2
+# movlhps %xmm6,%xmm6
+
+# Work on Lower arg
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+#If lower Arg is <=piby4
+ cmp %rdx,%rax # is upper arg > piby4
+ ja 0f
+
+ mov $0,%eax # region = 0
+ mov %eax,region(%rsp) # store upper region
+ movlpd %xmm0,r(%rsp) # store upper r
+ xorpd %xmm4,%xmm4 # rr = 0
+ movlpd %xmm4,rr(%rsp) # store upper rr
+ jmp .Lcheck_upper_arg
+
+.align 16
+0:
+#If upper Arg is > piby4
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm3 # piby2_1
+ cvttsd2si %xmm2,%eax # npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm1 # piby2_2
+ cvtsi2sd %eax,%xmm2 # npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm3 # npi2 * piby2_1;
+ subsd %xmm3,%xmm6 # rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm8 # piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm1 # rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm1,%xmm6 # rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm8 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm1 # (rtail-(t-rhead))
+ addsd %xmm8,%xmm1 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store lower region
+ movsd %xmm6,%xmm0
+ subsd %xmm1,%xmm0 # r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm1,%xmm6 # rr=((rhead-r) -rtail)
+ movlpd %xmm0,r(%rsp) # store lower r
+ movlpd %xmm6,rr(%rsp) # store lower rr
+
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+.align 16
+.Lcheck_upper_arg:
+ mov $0x07ff0000000000000,%r9 # is upper arg nan/inf
+ mov %r9,%r10
+ and %rcx,%r10
+ cmp %r9,%r10
+ jz .L__vrd2_cos_upper_naninf
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ mov %r11,p_temp(%rsp) #Save Sign
+ call __amd_remainder_piby2@PLT
+ mov p_temp(%rsp),%r11 #Restore Sign
+
+ jmp .L__vrd2_cos_reconstruct
+
+.L__vrd2_cos_upper_naninf:
+ mov p_original+8(%rsp),%rcx # upper arg is nan/inf
+ mov $0x00008000000000000,%r9
+ or %r9,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) # rr = 0
+ mov %r10d,region+4(%rsp) # region =0
+ and .L__real_naninf_upper_sign_mask(%rip),%r11 # Sign
+ jmp .L__vrd2_cos_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+
+ movhpd %xmm0, p_temp2(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r9 #is lower arg nan/inf
+ mov %r9,%r10
+ and %rax,%r10
+ cmp %r9,%r10
+ jz .L__vrd2_cos_lower_naninf_of_both_gt_5e5
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r11,p_temp1(%rsp) #Save Sign
+ call __amd_remainder_piby2@PLT
+ mov p_temp1(%rsp),%r11 #Restore Sign
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrd2_cos_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+ mov p_original(%rsp),%rax
+ mov $0x00008000000000000,%r9
+ or %r9,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) #rr = 0
+ mov %r10d,region(%rsp) #region = 0
+ and .L__real_naninf_lower_sign_mask(%rip),%r11 # Sign
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r9 #is upper arg nan/inf
+ mov %r9,%r10
+ and %rcx,%r10
+ cmp %r9,%r10
+ jz .L__vrd2_cos_upper_naninf_of_both_gt_5e5
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd p_temp2(%rsp), %xmm0 #Restore upper fp arg for remainder_piby2 call
+ mov %r11,p_temp(%rsp) #Save Sign
+ call __amd_remainder_piby2@PLT
+ mov p_temp(%rsp),%r11 #Restore Sign
+
+ jmp 0f
+
+.L__vrd2_cos_upper_naninf_of_both_gt_5e5:
+ mov p_original+8(%rsp),%rcx #upper arg is nan/inf
+ mov $0x00008000000000000,%r9
+ or %r9,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) #rr = 0
+ mov %r10d,region+4(%rsp) #region = 0
+ and .L__real_naninf_upper_sign_mask(%rip),%r11 # Sign
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+0:
+.L__vrd2_cos_reconstruct:
+#Construct p_sign=Sign for Sin term, p_sign1=Sign for Cos term, xmm0 = r, xmm2 = %xmm6,%r2 =rr, r8=region
+ movapd r(%rsp),%xmm0 #x
+ movapd %xmm0,%xmm2 #move for x2
+ mulpd %xmm2,%xmm2 #x2
+ movapd rr(%rsp),%xmm6 #xx
+
+ mov region(%rsp),%r8
+ mov .L__reald_one_zero(%rip),%r9 #compare value for cossin path
+ mov %r8,%r10
+ mov %r8,%rax
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ mov %r10,%rcx
+ not %r11 #ADDED TO CHANGE THE LOGIC
+ and %r11,%r10
+ not %rcx
+ not %r11
+ and %r11,%rcx
+ or %rcx,%r10
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+
+ mov %r10,%r11
+ and %r9,%r11 #mask out the lower sign bit leaving the upper sign bit
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $31,%r11 #shift upper sign bit left by 31 bits
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r11,p_sign+8(%rsp) #write out upper sign bit
+
+ add .L__reald_one_one(%rip),%rax
+ and .L__reald_two_two(%rip),%rax
+ shr $1,%rax
+
+ mov %rax,%rdx
+ and %r9,%rdx #mask out the lower sign bit leaving the upper sign bit
+ shl $63,%rax #shift lower sign bit left by 63 bits
+ shl $31,%rdx #shift upper sign bit left by 31 bits
+ mov %rax,p_sign1(%rsp) #write out lower sign bit
+ mov %rdx,p_sign1+8(%rsp) #write out upper sign bit
+
+
+ jmp .L__vrd2_sincos_approximate
+
+
+#ENDMAIN
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd2_sincos_cleanup:
+
+ xorpd p_sign(%rsp),%xmm5 # SIN sign
+ xorpd p_sign1(%rsp),%xmm4 # COS sign
+
+ mov p_sin(%rsp),%rdi
+ mov p_cos(%rsp),%rsi
+
+ movapd %xmm5,(%rdi) # save the sin
+ movapd %xmm4,(%rsi) # save the cos
+
+.Lfinal_check:
+ add $0x1C8,%rsp
+ ret
+
diff --git a/src/gas/vrd4cos.S b/src/gas/vrd4cos.S
new file mode 100644
index 0000000..5ecc97c
--- /dev/null
+++ b/src/gas/vrd4cos.S
@@ -0,0 +1,2987 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# vrd4cos.s
+#
+# A vector implementation of the cos libm function.
+#
+# Prototype:
+#
+# __m128d,__m128d __vrd4_cos(__m128d x1, __m128d x2);
+#
+# Computes Cosine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 double precision Cosine values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+
+.Lcosarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03fa5555555555555
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6
+ .quad 0x0bda907db46cc5e42
+.Lsinarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bfc5555555555555
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03f81111111110bb3
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0bf2a01a019e83e5c
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03ec71de3796cde01
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0be5ae600b42fdfa7
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x03de5e0b2f9a43bb8
+.Lsincosarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x0bda907db46cc5e42
+.Lcossinarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0bda907db46cc5e42
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+
+.align 16
+.Levencos_oddsin_tbl:
+ .quad .Lcoscos_coscos_piby4 # 0 *
+ .quad .Lcoscos_cossin_piby4 # 1 +
+ .quad .Lcoscos_sincos_piby4 # 2
+ .quad .Lcoscos_sinsin_piby4 # 3 +
+
+ .quad .Lcossin_coscos_piby4 # 4
+ .quad .Lcossin_cossin_piby4 # 5 *
+ .quad .Lcossin_sincos_piby4 # 6
+ .quad .Lcossin_sinsin_piby4 # 7
+
+ .quad .Lsincos_coscos_piby4 # 8
+ .quad .Lsincos_cossin_piby4 # 9
+ .quad .Lsincos_sincos_piby4 # 10 *
+ .quad .Lsincos_sinsin_piby4 # 11
+
+ .quad .Lsinsin_coscos_piby4 # 12
+ .quad .Lsinsin_cossin_piby4 # 13 +
+ .quad .Lsinsin_sincos_piby4 # 14
+ .quad .Lsinsin_sinsin_piby4 # 15 *
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .text
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_temp, 0x00 # temporary for get/put bits operation
+.equ p_temp1, 0x10 # temporary for get/put bits operation
+
+.equ p_xmm6, 0x20 # temporary for get/put bits operation
+.equ p_xmm7, 0x30 # temporary for get/put bits operation
+.equ p_xmm8, 0x40 # temporary for get/put bits operation
+.equ p_xmm9, 0x50 # temporary for get/put bits operation
+.equ p_xmm10, 0x60 # temporary for get/put bits operation
+.equ p_xmm11, 0x70 # temporary for get/put bits operation
+.equ p_xmm12, 0x80 # temporary for get/put bits operation
+.equ p_xmm13, 0x90 # temporary for get/put bits operation
+.equ p_xmm14, 0x0A0 # temporary for get/put bits operation
+.equ p_xmm15, 0x0B0 # temporary for get/put bits operation
+
+.equ r, 0x0C0 # pointer to r for remainder_piby2
+.equ rr, 0x0D0 # pointer to r for remainder_piby2
+.equ region, 0x0E0 # pointer to r for remainder_piby2
+
+.equ r1, 0x0F0 # pointer to r for remainder_piby2
+.equ rr1, 0x0100 # pointer to r for remainder_piby2
+.equ region1, 0x0110 # pointer to r for remainder_piby2
+
+.equ p_temp2, 0x0120 # temporary for get/put bits operation
+.equ p_temp3, 0x0130 # temporary for get/put bits operation
+
+.equ p_temp4, 0x0140 # temporary for get/put bits operation
+.equ p_temp5, 0x0150 # temporary for get/put bits operation
+
+.equ p_original, 0x0160 # original x
+.equ p_mask, 0x0170 # original x
+.equ p_sign, 0x0180 # original x
+
+.equ p_original1, 0x0190 # original x
+.equ p_mask1, 0x01A0 # original x
+.equ p_sign1, 0x01B0 # original x
+
+.globl __vrd4_cos
+ .type __vrd4_cos,@function
+__vrd4_cos:
+ sub $0x1C8,%rsp
+
+#DEBUG
+# add $0x1C8,%rsp
+# ret
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+movapd .L__real_7fffffffffffffff(%rip),%xmm2
+movdqa %xmm0, p_original(%rsp)
+movdqa %xmm1, p_original1(%rsp)
+
+andpd %xmm2,%xmm0 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+movd %xmm0,%rax #rax is lower arg
+movhpd %xmm0, p_temp+8(%rsp) #
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+movd %xmm1,%r8 #rax is lower arg
+movhpd %xmm1, p_temp1+8(%rsp) #
+mov p_temp1+8(%rsp),%r9 #rcx = upper arg
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use
+
+movapd %xmm0,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm0,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+#DEBUG
+# add $0x1C8,%rsp
+# ret
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm0
+ mulpd %xmm0,%xmm2 # * twobypi
+ mulpd %xmm0,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%rax # Region
+ movd %xmm5,%rcx # Region
+
+ mov %rax,%r8
+ mov %rcx,%r9
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+# paddd .L__reald_one_one(%rip),%xmm4 ; Sign
+# paddd .L__reald_one_one(%rip),%xmm5 ; Sign
+# pand .L__reald_two_two(%rip),%xmm4
+# pand .L__reald_two_two(%rip),%xmm5
+# punpckldq %xmm4,%xmm4
+# punpckldq %xmm5,%xmm5
+# psllq $62,%xmm4
+# psllq $62,%xmm5
+
+
+ add .L__reald_one_one(%rip),%r8
+ add .L__reald_one_one(%rip),%r9
+ and .L__reald_two_two(%rip),%r8
+ and .L__reald_two_two(%rip),%r9
+
+ mov %r8,%r10
+ mov %r9,%r11
+ shl $62,%r8
+ and .L__reald_two_zero(%rip),%r10
+ shl $30,%r10
+ shl $62,%r9
+ and .L__reald_two_zero(%rip),%r11
+ shl $30,%r11
+
+ mov %r8,p_sign(%rsp)
+ mov %r10,p_sign+8(%rsp)
+ mov %r9,p_sign1(%rsp)
+ mov %r11,p_sign1+8(%rsp)
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm0,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ and .L__reald_one_one(%rip),%rax # Region
+ and .L__reald_one_one(%rip),%rcx # Region
+
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+ subpd %xmm0,%xmm6 #rr=rhead-r
+ subpd %xmm1,%xmm7 #rr=rhead-r
+
+ mov %rax,%r8
+ mov %rcx,%r9
+
+ movapd %xmm0,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm0,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail
+ subpd %xmm9,%xmm7 #rr=(rhead-r) -rtail
+
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+ leaq .Levencos_oddsin_tbl(%rip),%rsi
+ jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+#DEBUG
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm10, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+#DEBUG
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+ movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r+8(%rsp) # store upper r
+ movlpd %xmm6,rr+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_lower_naninf
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrd4_cos_lower_naninf:
+ mov p_original(%rsp),%rax # upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) # rr = 0
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+
+
+#DEBUG
+# movapd .LOWORD,%xmm4 PTR r[rsp]
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+
+#DEBUG
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrd4_cos_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+ mov p_original(%rsp),%rax
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) #rr = 0
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrd4_cos_upper_naninf_of_both_gt_5e5:
+ mov p_original+8(%rsp),%rcx #upper arg is nan/inf
+# movd %xmm6,%rcx ;upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) #rr = 0
+ mov %r10d,region+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm10,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm0,%xmm0 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm2,%xmm2
+# movlhps %xmm6,%xmm6
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r(%rsp) # store upper r
+ movlpd %xmm6,rr(%rsp) # store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_upper_naninf
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrd4_cos_upper_naninf:
+ mov p_original+8(%rsp),%rcx # upper arg is nan/inf
+# mov r+8(%rsp),%rcx ; upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) # rr = 0
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+#DEBUG
+# movapd r(%rsp),%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm5,region1(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm1,%xmm7 # rhead
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+ subpd %xmm1,%xmm7 # rr=rhead-r
+ subpd %xmm9,%xmm7 # rr=(rhead-r) -rtail
+ movapd %xmm7,rr1(%rsp)
+
+ jmp .L__vrd4_cos_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm10, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+#DEBUG
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm4,region(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ movapd %xmm0,r(%rsp)
+
+ subpd %xmm0,%xmm6 # rr=rhead-r
+ subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail
+ movapd %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+
+#DEBUG
+# movapd r(%rsp),%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ mov $0x411E848000000000,%r10 #5e5 +
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+ movsd %xmm7,%xmm1
+ subsd %xmm0,%xmm1 # xmm1 = r=(rhead-rtail)
+ subsd %xmm1,%xmm7 # rr=rhead-r
+ subsd %xmm0,%xmm7 # xmm7 = rr=((rhead-r) -rtail)
+ movlpd %xmm1,r1+8(%rsp) # store upper r
+ movlpd %xmm7,rr1+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr1(%rsp),%rsi
+ lea r1(%rsp),%rdi
+ movlpd r1(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_cos_lower_naninf_higher:
+ mov p_original1(%rsp),%r8 # upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1(%rsp) # rr = 0
+ mov %r10d,region1(%rsp) # region =0
+
+.align 16
+0:
+
+
+#DEBUG
+# movapd rr(%rsp),%xmm4
+# movapd rr1(%rsp),%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ jmp .L__vrd4_cos_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+#DEBUG
+# movapd r(%rsp),%xmm4
+# movd %r8,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr1(%rsp),%rsi
+ lea r1(%rsp),%rdi
+ movsd %xmm1,%xmm0
+ call __amd_remainder_piby2@PLT
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+
+
+#DEBUG
+# movapd r(%rsp),%xmm4
+# mov QWORD PTR r1[rsp+8], r9
+# movapd r1(%rsp),%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+ jmp 0f
+
+.L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+ mov p_original1(%rsp),%r8
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1(%rsp) #rr = 0
+ mov %r10d,region1(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr1+8(%rsp),%rsi
+ lea r1+8(%rsp),%rdi
+ movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher:
+ mov p_original1+8(%rsp),%r9 #upper arg is nan/inf
+# movd %xmm6,%r9 ;upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1+8(%rsp) #rr = 0
+ mov %r10d,region1+4(%rsp) #region = 0
+
+.align 16
+0:
+
+#DEBUG
+# movapd r(%rsp),%xmm4
+# movapd r1(%rsp),%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+ jmp .L__vrd4_cos_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm4,region(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ movapd %xmm0,r(%rsp)
+
+ subpd %xmm0,%xmm6 # rr=rhead-r
+ subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail
+ movapd %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm1,%xmm1 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm3,%xmm3
+# movlhps %xmm7,%xmm7
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+ movsd %xmm7,%xmm1
+ subsd %xmm0,%xmm1 # xmm0 = r=(rhead-rtail)
+ subsd %xmm1,%xmm7 # rr=rhead-r
+ subsd %xmm0,%xmm7 # xmm6 = rr=((rhead-r) -rtail)
+
+ movlpd %xmm1,r1(%rsp) # store upper r
+ movlpd %xmm7,rr1(%rsp) # store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr1+8(%rsp),%rsi
+ lea r1+8(%rsp),%rdi
+ movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_cos_upper_naninf_higher:
+ mov p_original1+8(%rsp),%r9 # upper arg is nan/inf
+# mov r1+8(%rsp),%r9 # upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1+8(%rsp) # rr = 0
+ mov %r10d,region1+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrd4_cos_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_cos_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#DEBUG
+# movapd region(%rsp),%xmm4
+# movapd region1(%rsp),%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ movapd r(%rsp),%xmm0
+ movapd r1(%rsp),%xmm1
+
+ movapd rr(%rsp),%xmm6
+ movapd rr1(%rsp),%xmm7
+
+ mov region(%rsp),%rax
+ mov region1(%rsp),%rcx
+
+ mov %rax,%r8
+ mov %rcx,%r9
+
+ add .L__reald_one_one(%rip),%r8
+ add .L__reald_one_one(%rip),%r9
+ and .L__reald_two_two(%rip),%r8
+ and .L__reald_two_two(%rip),%r9
+
+ mov %r8,%r10
+ mov %r9,%r11
+ shl $62,%r8
+ and .L__reald_two_zero(%rip),%r10
+ shl $30,%r10
+ shl $62,%r9
+ and .L__reald_two_zero(%rip),%r11
+ shl $30,%r11
+
+ mov %r8,p_sign(%rsp)
+ mov %r10,p_sign+8(%rsp)
+ mov %r9,p_sign1(%rsp)
+ mov %r11,p_sign1+8(%rsp)
+
+ and .L__reald_one_one(%rip),%rax # Region
+ and .L__reald_one_one(%rip),%rcx # Region
+
+ mov %rax,%r8
+ mov %rcx,%r9
+
+ movapd %xmm0,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm0,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+
+
+#DEBUG
+# movd %rax,%xmm4
+# movd %rax,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ leaq .Levencos_oddsin_tbl(%rip),%rsi
+ jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_cos_cleanup:
+
+ movapd p_sign(%rsp),%xmm0
+ movapd p_sign1(%rsp),%xmm1
+
+ xorpd %xmm4,%xmm0 # (+) Sign
+ xorpd %xmm5,%xmm1 # (+) Sign
+
+ add $0x1C8,%rsp
+ ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm2 # x4 recalculate
+ mulpd %xmm3,%xmm3 # x4 recalculate
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subpd %xmm13,%xmm5 # + t
+
+ jmp .L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # s6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # s6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # s3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm10,%xmm10 # move high x4 for cos term
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd %xmm0,%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos
+ movhlps %xmm3,%xmm13 # move high r for cos
+ movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos
+ movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm7,%xmm5 # sin *x3
+ mulsd %xmm10,%xmm8 # cos *x4
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term
+ movsd %xmm12,%xmm6 # Keep high r for cos term
+ movsd %xmm13,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t=r-1.0
+
+ subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx
+ subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x*xx for cos term
+ movhlps %xmm1,%xmm11 # move high x for x*xx for cos term
+
+ mulsd p_temp+8(%rsp),%xmm10 # x * xx
+ mulsd p_temp1+8(%rsp),%xmm11 # x * xx
+
+ movsd %xmm12,%xmm2 # move -t for cos term
+ movsd %xmm13,%xmm3 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t)
+ addsd p_temp(%rsp),%xmm4 # sin+xx
+ addsd p_temp1(%rsp),%xmm5 # sin+xx
+ subsd %xmm6,%xmm12 # (1-t) - r
+ subsd %xmm7,%xmm13 # (1-t) - r
+ subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx
+ addsd %xmm0,%xmm4 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+ addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx)
+ addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx)
+ subsd %xmm2,%xmm8 # cos+t
+ subsd %xmm3,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4: # changed from sincos_sincos
+ # xmm1 is cossin and xmm0 is sincos
+
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm1,p_temp3(%rsp) # Store r for the sincos term
+
+ movapd .Lsincosarray+0x50(%rip),%xmm4 # s6
+ movapd .Lcossinarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lsincosarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm10,%xmm10 # move high x4 for cos term
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin)
+ movhlps %xmm3,%xmm7 # move high x2 for x3 for sin term (sincos)
+
+ mulsd %xmm0,%xmm6 # get low x3 for sin term
+ mulsd p_temp3+8(%rsp),%xmm7 # get high x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos (cossin)
+ movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term (sincos)
+
+ movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin)
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos)
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm11,%xmm5 # cos *x4
+ mulsd %xmm10,%xmm8 # cos *x4
+ mulsd %xmm7,%xmm9 # sin *x3
+
+ mulsd p_temp(%rsp),%xmm2 # low 0.5 * x2 * xx for sin term (cossin)
+ mulsd p_temp1+8(%rsp),%xmm13 # high 0.5 * x2 * xx for sin term (sincos)
+
+ movsd %xmm12,%xmm6 # Keep high r for cos term
+ movsd %xmm3,%xmm7 # Keep low r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0
+
+ subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx (cossin)
+ subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx (sincos)
+
+ movhlps %xmm0,%xmm10 # move high x for x*xx for cos term (cossin)
+ movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos)
+
+ mulsd p_temp+8(%rsp),%xmm10 # x * xx
+ mulsd p_temp1(%rsp),%xmm1 # x * xx
+
+ movsd %xmm12,%xmm2 # move -t for cos term
+ movsd %xmm3,%xmm13 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t)
+
+ addsd p_temp(%rsp),%xmm4 # sin+xx +
+ addsd p_temp1+8(%rsp),%xmm9 # sin+xx +
+
+ subsd %xmm6,%xmm12 # (1-t) - r
+ subsd %xmm7,%xmm3 # (1-t) - r
+
+ subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm0,%xmm4 # sin + x +
+ addsd %xmm11,%xmm9 # sin + x +
+
+ addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx)
+ addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm2,%xmm8 # cos+t
+ subsd %xmm13,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm0,p_temp2(%rsp) # Store r
+ movapd %xmm1,p_temp3(%rsp) # Store r
+
+
+ movapd .Lcossinarray+0x50(%rip),%xmm4 # s6
+ movapd .Lcossinarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movhlps %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term
+ mulsd p_temp3+8(%rsp),%xmm7 # get low x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term
+ movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term
+ # Reverse 12 and 2
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos
+
+ mulsd %xmm6,%xmm8 # sin *x3
+ mulsd %xmm7,%xmm9 # sin *x3
+ mulsd %xmm10,%xmm4 # cos *x4
+ mulsd %xmm11,%xmm5 # cos *x4
+
+ mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1+8(%rsp),%xmm13 # 0.5 * x2 * xx for sin term
+ movsd %xmm2,%xmm6 # Keep high r for cos term
+ movsd %xmm3,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0
+
+ subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx
+ subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x for sin term
+ # Reverse 10 and 0
+
+ mulsd p_temp(%rsp),%xmm0 # x * xx
+ mulsd p_temp1(%rsp),%xmm1 # x * xx
+
+ movsd %xmm2,%xmm12 # move -t for cos term
+ movsd %xmm3,%xmm13 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t)
+ addsd p_temp+8(%rsp),%xmm8 # sin+xx
+ addsd p_temp1+8(%rsp),%xmm9 # sin+xx
+
+ subsd %xmm6,%xmm2 # (1-t) - r
+ subsd %xmm7,%xmm3 # (1-t) - r
+
+ subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm8 # sin + x
+ addsd %xmm11,%xmm9 # sin + x
+
+ addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx)
+ addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm12,%xmm4 # cos+t
+ subsd %xmm13,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_sincos_piby4: # changed from sincos_sincos
+ # xmm1 is cossin and xmm0 is sincos
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm0,p_temp2(%rsp) # Store r
+
+
+ movapd .Lcossinarray+0x50(%rip),%xmm4 # s6
+ movapd .Lsincosarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lsincosarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm11,%xmm11 # move high x4 for cos term +
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term +
+ mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term +
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term
+ movhlps %xmm3,%xmm13 # move high r for cos
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin
+
+ mulsd %xmm6,%xmm8 # sin *x3
+ mulsd %xmm11,%xmm9 # cos *x4
+ mulsd %xmm10,%xmm4 # cos *x4
+ mulsd %xmm7,%xmm5 # sin *x3
+
+ mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term
+
+ movsd %xmm2,%xmm6 # Keep high r for cos term
+ movsd %xmm13,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t=r-1.0
+
+ subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx
+ subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x*xx for cos term
+
+ mulsd p_temp(%rsp),%xmm0 # x * xx
+ mulsd p_temp1+8(%rsp),%xmm11 # x * xx
+
+ movsd %xmm2,%xmm12 # move -t for cos term
+ movsd %xmm13,%xmm3 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t)
+
+ addsd p_temp+8(%rsp),%xmm8 # sin+xx
+ addsd p_temp1(%rsp),%xmm5 # sin+xx
+
+ subsd %xmm6,%xmm2 # (1-t) - r
+ subsd %xmm7,%xmm13 # (1-t) - r
+
+ subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx
+
+
+ addsd %xmm10,%xmm8 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+
+ addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx)
+ addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm12,%xmm4 # cos+t
+ subsd %xmm3,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ movapd %xmm2,p_temp2(%rsp) # store x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ movapd %xmm11,p_temp3(%rsp) # store r
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for 0.5*x2
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm10 # x6
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm12 # 0.5 *x2
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm0,%xmm2 # x3 recalculate
+ mulpd %xmm3,%xmm3 # x4 recalculate
+
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm6,%xmm12 # 0.5 * x2 *xx
+ mulpd %xmm1,%xmm7 # x * xx
+
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm12,%xmm4 # -0.5 * x2 *xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm6,%xmm4 # x3 * zs +xx
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+ addpd %xmm0,%xmm4 # +x
+ subpd %xmm13,%xmm5 # + t
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ movapd %xmm3,p_temp3(%rsp) # store x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ movapd %xmm10,p_temp2(%rsp) # store r
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ mulpd %xmm3,%xmm11 # x4
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for 0.5*x2
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm11 # x6
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd .L__real_3fe0000000000000(%rip),%xmm13 # 0.5 *x2
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm2 # x4 recalculate
+ mulpd %xmm1,%xmm3 # x3 recalculate
+
+ movapd p_temp2(%rsp),%xmm12 # r
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm7,%xmm13 # 0.5 * x2 *xx
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x3 * zs
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;;
+ subpd %xmm13,%xmm5 # -0.5 * x2 *xx
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm7,%xmm5 # +xx
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ addpd %xmm1,%xmm5 # +x
+ subpd %xmm12,%xmm4 # + t
+
+ jmp .L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4: #Derive from cossin_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+ movhlps %xmm10,%xmm10 # get upper r for t for cos
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ movsd %xmm0,%xmm8 # lower x for sin
+ mulsd %xmm2,%xmm8 # lower x3 for sin
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # upper x4 for cos
+ movsd %xmm8,%xmm2 # lower x3 for sin
+
+ movsd %xmm6,%xmm9 # lower xx
+ # note using odd reg
+
+ movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm0,%xmm6 # x * xx for upper cos term
+ mulpd %xmm1,%xmm7 # x * xx
+ movhlps %xmm6,%xmm6
+ mulsd p_temp2(%rsp),%xmm9 # xx * 0.5*x2 for sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+ # x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin
+
+ subsd %xmm9,%xmm4 # x3zs - 0.5*x2*xx
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp(%rsp),%xmm4 # +xx
+
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subsd %xmm12,%xmm8 # + t
+ addsd %xmm0,%xmm4 # +x
+ subpd %xmm13,%xmm5 # + t
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4: #Derive from sincos_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcossinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zszc
+ addpd %xmm9,%xmm5 # z
+
+ mulpd %xmm0,%xmm2 # upper x3 for sin
+ mulsd %xmm0,%xmm2 # lower x4 for cos
+ mulpd %xmm3,%xmm3 # x4
+
+ movhlps %xmm6,%xmm9 # upper xx for sin term
+ # note using odd reg
+
+ movlpd p_temp2(%rsp),%xmm12 # lower r for cos term
+ movapd p_temp3(%rsp),%xmm13 # r
+
+
+ mulpd %xmm0,%xmm6 # x * xx for lower cos term
+ mulpd %xmm1,%xmm7 # x * xx
+
+ mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+ mulpd %xmm3,%xmm5
+ # x4 * zc
+
+ movhlps %xmm4,%xmm8 # xmm8= sin, xmm4= cos
+ subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx
+
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp+8(%rsp),%xmm8 # +xx
+
+ movhlps %xmm0,%xmm0 # upper x for sin
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subsd %xmm12,%xmm4 # + t
+ subpd %xmm13,%xmm5 # + t
+ addsd %xmm0,%xmm8 # +x
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+ movhlps %xmm11,%xmm11 # get upper r for t for cos
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zcs
+
+ movsd %xmm1,%xmm9 # lower x for sin
+ mulsd %xmm3,%xmm9 # lower x3 for sin
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # upper x4 for cos
+ movsd %xmm9,%xmm3 # lower x3 for sin
+
+ movsd %xmm7,%xmm8 # lower xx
+ # note using even reg
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx for upper cos term
+ movhlps %xmm7,%xmm7
+ mulsd p_temp3(%rsp),%xmm8 # xx * 0.5*x2 for sin term
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+ # x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin
+
+ subsd %xmm8,%xmm5 # x3zs - 0.5*x2*xx
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1(%rsp),%xmm5 # +xx
+
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subsd %xmm13,%xmm9 # + t
+ addsd %xmm1,%xmm5 # +x
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4: # Derived from sincos_sinsin
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ movhlps %xmm11,%xmm11
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm6,%xmm10 # 0.5*x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zczs
+
+ movsd %xmm3,%xmm12
+ mulsd %xmm1,%xmm12 # low x3 for sin
+
+ mulpd %xmm0,%xmm2 # x3
+ mulpd %xmm3,%xmm3 # high x4 for cos
+ movsd %xmm12,%xmm3 # low x3 for sin
+
+ movhlps %xmm1,%xmm8 # upper x for cos term
+ # note using even reg
+ movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term
+
+ mulsd p_temp1+8(%rsp),%xmm8 # x * xx for upper cos term
+
+ mulsd p_temp3(%rsp),%xmm7 # xx * 0.5*x2 for lower sin term
+
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin
+
+ subsd %xmm7,%xmm5 # x3zs - 0.5*x2*xx
+
+ subsd %xmm8,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx
+ addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1(%rsp),%xmm5 # +xx
+
+ addpd %xmm6,%xmm4 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+
+ addsd %xmm1,%xmm5 # +x
+ addpd %xmm0,%xmm4 # +x
+ subsd %xmm13,%xmm9 # + t
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcossinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm7,%xmm8 # upper xx for sin term
+ # note using even reg
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movlpd p_temp3(%rsp),%xmm13 # lower r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx for lower cos term
+
+ mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos
+
+ subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1+8(%rsp),%xmm9 # +xx
+
+ movhlps %xmm1,%xmm1 # upper x for sin
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subsd %xmm13,%xmm5 # + t
+ addsd %xmm1,%xmm9 # +x
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsincos_sinsin_piby4: # Derived from sincos_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcossinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm6,%xmm10 # 0.5x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm0,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm7,%xmm8 # upper xx for sin term
+ # note using even reg
+
+ movlpd p_temp3(%rsp),%xmm13 # lower r for cos term
+
+ mulpd %xmm1,%xmm7 # x * xx for lower cos term
+
+ mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos
+
+ subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx
+
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx
+ addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1+8(%rsp),%xmm9 # +xx
+
+ movhlps %xmm1,%xmm1 # upper x for sin
+ addpd %xmm6,%xmm4 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ addsd %xmm1,%xmm9 # +x
+ addpd %xmm0,%xmm4 # +x
+ subsd %xmm13,%xmm5 # + t
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsinsin_cossin_piby4: # Derived from sincos_sinsin
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # x2
+ movapd %xmm6,p_temp(%rsp) # xx
+
+ movhlps %xmm10,%xmm10
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm7,%xmm11 # 0.5*x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zs
+
+
+ movsd %xmm2,%xmm13
+ mulsd %xmm0,%xmm13 # low x3 for sin
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm2,%xmm2 # high x4 for cos
+ movsd %xmm13,%xmm2 # low x3 for sin
+
+
+ movhlps %xmm0,%xmm9 # upper x for cos term ; note using even reg
+ movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term
+ mulsd p_temp+8(%rsp),%xmm9 # x * xx for upper cos term
+ mulsd p_temp2(%rsp),%xmm6 # xx * 0.5*x2 for lower sin term
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin
+ subsd %xmm6,%xmm4 # x3zs - 0.5*x2*xx
+
+ subsd %xmm9,%xmm10 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx
+
+ addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp(%rsp),%xmm4 # +xx
+
+ addpd %xmm7,%xmm5 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+
+ addsd %xmm0,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+ subsd %xmm12,%xmm8 # + t
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4: # Derived from sincos_coscos
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcossinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm7,%xmm11 # 0.5x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm0,%xmm2 # upper x3 for sin
+ mulsd %xmm0,%xmm2 # lower x4 for cos
+
+ movhlps %xmm6,%xmm9 # upper xx for sin term
+ # note using even reg
+
+ movlpd p_temp2(%rsp),%xmm12 # lower r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx for lower cos term
+
+ mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm9= sin, xmm5= cos
+
+ subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx
+ addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp+8(%rsp),%xmm8 # +xx
+
+ movhlps %xmm0,%xmm0 # upper x for sin
+ addpd %xmm7,%xmm5 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+
+
+ addsd %xmm0,%xmm8 # +x
+ addpd %xmm1,%xmm5 # +x
+ subsd %xmm12,%xmm4 # + t
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ movapd %xmm2,p_temp2(%rsp) # copy of x2
+ movapd %xmm3,p_temp3(%rsp) # copy of x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm2,%xmm10 # x6
+ mulpd %xmm3,%xmm11 # x6
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 *x2
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm6,%xmm2 # 0.5 * x2 *xx
+ mulpd %xmm7,%xmm3 # 0.5 * x2 *xx
+
+ mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zs
+
+ movapd p_temp2(%rsp),%xmm10 # x2
+ movapd p_temp3(%rsp),%xmm11 # x2
+
+ mulpd %xmm0,%xmm10 # x3
+ mulpd %xmm1,%xmm11 # x3
+
+ mulpd %xmm10,%xmm4 # x3 * zs
+ mulpd %xmm11,%xmm5 # x3 * zs
+
+ subpd %xmm2,%xmm4 # -0.5 * x2 *xx
+ subpd %xmm3,%xmm5 # -0.5 * x2 *xx
+
+ addpd %xmm6,%xmm4 # +xx
+ addpd %xmm7,%xmm5 # +xx
+
+ addpd %xmm0,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrd4_cos_cleanup
diff --git a/src/gas/vrd4exp.S b/src/gas/vrd4exp.S
new file mode 100644
index 0000000..a05af8b
--- /dev/null
+++ b/src/gas/vrd4exp.S
@@ -0,0 +1,502 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd4exp.s
+#
+# A vector implementation of the exp libm function.
+#
+# Prototype:
+#
+# __m128d,__m128d __vrd4_exp(__m128d x1, __m128d x2);
+#
+# Computes e raised to the x power for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking. Denormal results are truncated to 0.
+#
+# This routine computes 4 double precision exponent values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ p_temp,0 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for exponent multiply
+
+.equ save_rbx,0x020 #qword
+.equ save_rdi,0x028 #qword
+
+.equ save_rsi,0x030 #qword
+
+
+
+.equ p2_temp,0x40 # second temporary for get/put bits operation
+.equ p2_temp1,0x60 # second temporary for exponent multiply
+
+
+.equ stack_size,0x088
+
+
+# parameters are passed in by Linux as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrd4_exp
+ .type __vrd4_exp,@function
+__vrd4_exp:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rdi
+ movapd %xmm1,%xmm6
+
+# process 4 values at a time.
+
+ movapd .L__real_thirtytwo_by_log2(%rip),%xmm3 #
+
+# Step 1. Reduce the argument.
+# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+# r = x * thirtytwo_by_logbaseof2;
+ movapd %xmm3,%xmm7
+ movapd %xmm0,p_temp(%rsp)
+ maxpd .L__real_C0F0000000000000(%rip),%xmm0 # protect against very large negative, non-infinite numbers
+ mulpd %xmm0,%xmm3
+
+ movapd %xmm6,p2_temp(%rsp)
+ maxpd .L__real_C0F0000000000000(%rip),%xmm6
+ mulpd %xmm6,%xmm7
+
+# save x for later.
+ minpd .L__real_40F0000000000000(%rip),%xmm3 # protect against very large, non-infinite numbers
+
+
+# /* Set n = nearest integer to r */
+ cvtpd2dq %xmm3,%xmm4
+ lea .L__two_to_jby32_lead_table(%rip),%rdi
+ lea .L__two_to_jby32_trail_table(%rip),%rsi
+ cvtdq2pd %xmm4,%xmm1
+ minpd .L__real_40F0000000000000(%rip),%xmm7 # protect against very large, non-infinite numbers
+
+ # r1 = x - n * logbaseof2_by_32_lead;
+ movapd .L__real_log2_by_32_lead(%rip),%xmm2 #
+ mulpd %xmm1,%xmm2 #
+ movq %xmm4,p_temp1(%rsp)
+ subpd %xmm2,%xmm0 # r1 in xmm0,
+
+ cvtpd2dq %xmm7,%xmm2
+ cvtdq2pd %xmm2,%xmm8
+
+# r2 = - n * logbaseof2_by_32_trail;
+ mulpd .L__real_log2_by_32_tail(%rip),%xmm1 # r2 in xmm1
+# j = n & 0x0000001f;
+ mov $0x01f,%r9
+ mov %r9,%r8
+ mov p_temp1(%rsp),%ecx
+ and %ecx,%r9d
+ movq %xmm2,p2_temp1(%rsp)
+ movapd .L__real_log2_by_32_lead(%rip),%xmm9
+ mulpd %xmm8,%xmm9
+ subpd %xmm9,%xmm6 # r1b in xmm6
+ mulpd .L__real_log2_by_32_tail(%rip),%xmm8 # r2b in xmm8
+
+ mov p_temp1+4(%rsp),%edx
+ and %edx,%r8d
+# f1 = two_to_jby32_lead_table[j];
+# f2 = two_to_jby32_trail_table[j];
+
+# *m = (n - j) / 32;
+ sub %r9d,%ecx
+ sar $5,%ecx #m
+ sub %r8d,%edx
+ sar $5,%edx
+
+
+ movapd %xmm0,%xmm2
+ addpd %xmm1,%xmm2 # r = r1 + r2
+
+ mov $0x01f,%r11
+ mov %r11,%r10
+ mov p2_temp1(%rsp),%ebx
+ and %ebx,%r11d
+# Step 2. Compute the polynomial.
+# q = r1 + (r2 +
+# r*r*( 5.00000000000000008883e-01 +
+# r*( 1.66666666665260878863e-01 +
+# r*( 4.16666666662260795726e-02 +
+# r*( 8.33336798434219616221e-03 +
+# r*( 1.38889490863777199667e-03 ))))));
+# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+ movapd %xmm2,%xmm1
+ movapd .L__real_3f56c1728d739765(%rip),%xmm3 # 1/720
+ movapd .L__real_3FC5555555548F7C(%rip),%xmm0 # 1/6
+# deal with infinite results
+ mov $1024,%rax
+ movsx %ecx,%rcx
+ cmp %rax,%rcx
+
+ mulpd %xmm2,%xmm3 # *x
+ mulpd %xmm2,%xmm0 # *x
+ mulpd %xmm2,%xmm1 # x*x
+ movapd %xmm1,%xmm4
+
+ cmovg %rax,%rcx ## if infinite, then set rcx to multiply
+ # by infinity
+ movsx %edx,%rdx
+ cmp %rax,%rdx
+
+ movapd %xmm6,%xmm9
+ addpd %xmm8,%xmm9 # rb = r1b + r2b
+ addpd .L__real_3F811115B7AA905E(%rip),%xmm3 # + 1/120
+ addpd .L__real_3fe0000000000000(%rip),%xmm0 # + .5
+ mulpd %xmm1,%xmm4 # x^4
+ mulpd %xmm2,%xmm3 # *x
+
+ cmovg %rax,%rdx ## if infinite, then set rcx to multiply
+ # by infinity
+# deal with denormal results
+ xor %rax,%rax
+ add $1023,%rcx # add bias
+
+ mulpd %xmm1,%xmm0 # *x^2
+ addpd .L__real_3FA5555555545D4E(%rip),%xmm3 # + 1/24
+ addpd %xmm2,%xmm0 # + x
+ mulpd %xmm4,%xmm3 # *x^4
+
+# check for infinity or nan
+ movapd p_temp(%rsp),%xmm2
+
+ cmovs %rax,%rcx ## if denormal, then multiply by 0
+ shl $52,%rcx # build 2^n
+
+ sub %r11d,%ebx
+ movapd %xmm9,%xmm1
+ addpd %xmm3,%xmm0 # q = final sum
+ movapd .L__real_3f56c1728d739765(%rip),%xmm7 # 1/720
+ movapd .L__real_3FC5555555548F7C(%rip),%xmm3 # 1/6
+
+# *z2 = f2 + ((f1 + f2) * q);
+ movlpd (%rsi,%r9,8),%xmm5 # f2
+ movlpd (%rsi,%r8,8),%xmm4 # f2
+ addsd (%rdi,%r8,8),%xmm4 # f1 + f2
+ addsd (%rdi,%r9,8),%xmm5 # f1 + f2
+ mov p2_temp1+4(%rsp),%r8d
+ and %r8d,%r10d
+ sar $5,%ebx #m
+ mulpd %xmm9,%xmm7 # *x
+ mulpd %xmm9,%xmm3 # *x
+ mulpd %xmm9,%xmm1 # x*x
+ sub %r10d,%r8d
+ sar $5,%r8d
+# check for infinity or nan
+ andpd .L__real_infinity(%rip),%xmm2
+ cmppd $0,.L__real_infinity(%rip),%xmm2
+ add $1023,%rdx # add bias
+ shufpd $0,%xmm4,%xmm5
+ movapd %xmm1,%xmm4
+
+ cmovs %rax,%rdx ## if denormal, then multiply by 0
+ shl $52,%rdx # build 2^n
+
+ mulpd %xmm5,%xmm0
+ mov %rcx,p_temp1(%rsp) # get 2^n to memory
+ mov %rdx,p_temp1+8(%rsp) # get 2^n to memory
+ addpd %xmm5,%xmm0 #z = z1 + z2
+ mov $1024,%rax
+ movsx %ebx,%rbx
+ cmp %rax,%rbx
+# end of splitexp
+# /* Scale (z1 + z2) by 2.0**m */
+# r = scaleDouble_1(z, n);
+
+
+ cmovg %rax,%rbx ## if infinite, then set rcx to multiply
+ # by infinity
+ movsx %r8d,%rdx
+ cmp %rax,%rdx
+
+ movmskpd %xmm2,%r8d
+
+ addpd .L__real_3F811115B7AA905E(%rip),%xmm7 # + 1/120
+ addpd .L__real_3fe0000000000000(%rip),%xmm3 # + .5
+ mulpd %xmm1,%xmm4 # x^4
+ mulpd %xmm9,%xmm7 # *x
+ cmovg %rax,%rdx ## if infinite, then set rcx to multiply
+
+
+ xor %rax,%rax
+ add $1023,%rbx # add bias
+
+ mulpd %xmm1,%xmm3 # *x^2
+ addpd .L__real_3FA5555555545D4E(%rip),%xmm7 # + 1/24
+ addpd %xmm9,%xmm3 # + x
+ mulpd %xmm4,%xmm7 # *x^4
+
+ cmovs %rax,%rbx ## if denormal, then multiply by 0
+ shl $52,%rbx # build 2^n
+
+# Step 3. Reconstitute.
+
+ mulpd p_temp1(%rsp),%xmm0 # result *= 2^n
+ addpd %xmm7,%xmm3 # q = final sum
+
+ movlpd (%rsi,%r11,8),%xmm5 # f2
+ movlpd (%rsi,%r10,8),%xmm4 # f2
+ addsd (%rdi,%r10,8),%xmm4 # f1 + f2
+ addsd (%rdi,%r11,8),%xmm5 # f1 + f2
+
+ add $1023,%rdx # add bias
+ cmovs %rax,%rdx ## if denormal, then multiply by 0
+ shufpd $0,%xmm4,%xmm5
+ shl $52,%rdx # build 2^n
+
+ mulpd %xmm5,%xmm3
+ mov %rbx,p2_temp1(%rsp) # get 2^n to memory
+ mov %rdx,p2_temp1+8(%rsp) # get 2^n to memory
+ addpd %xmm5,%xmm3 #z = z1 + z2
+
+ movapd p2_temp(%rsp),%xmm2
+ andpd .L__real_infinity(%rip),%xmm2
+ cmppd $0,.L__real_infinity(%rip),%xmm2
+ movmskpd %xmm2,%ebx
+ test $3,%r8d
+ mulpd p2_temp1(%rsp),%xmm3 # result *= 2^n
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them. But it adds cycles for normal cases which
+# are supposed to be exceptions. Using this branch with the
+# check above results in faster code for the normal cases.
+ jnz .L__exp_naninf
+
+.L__vda_bottom1:
+# store the result _m128d
+ test $3,%ebx
+ jnz .L__exp_naninf2
+
+.L__vda_bottom2:
+
+ movapd %xmm3,%xmm1
+
+
+#
+#
+.L__final_check:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+# at least one of the numbers needs special treatment
+.L__exp_naninf:
+ lea p_temp(%rsp),%rcx
+ call .L__naninf
+ jmp .L__vda_bottom1
+.L__exp_naninf2:
+ lea p2_temp(%rsp),%rcx
+ mov %ebx,%r8d
+ movapd %xmm0,%xmm4
+ movapd %xmm3,%xmm0
+ call .L__naninf
+ movapd %xmm0,%xmm3
+ movapd %xmm4,%xmm0
+ jmp .L__vda_bottom2
+
+# This subroutine checks a double pair for nans and infinities and
+# produces the proper result from the exceptional inputs
+# Register assumptions:
+# Inputs:
+# r8d - mask of errors
+# xmm0 - computed result vector
+# rcx - pointing to memory image of inputs
+# Outputs:
+# xmm0 - new result vector
+# %rax,rdx,,%xmm2 all modified.
+.L__naninf:
+# check the first number
+ test $1,%r8d
+ jz .L__check2
+
+ mov (%rcx),%rdx
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__enan1 # jump if mantissa not zero, so it's a NaN
+# inf
+ mov %rdx,%rax
+ rcl $1,%rax
+ jnc .L__r1 # exp(+inf) = inf
+ xor %rdx,%rdx # exp(-inf) = 0
+ jmp .L__r1
+
+#NaN
+.L__enan1:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__r1:
+ movd %rdx,%xmm2
+ shufpd $2,%xmm0,%xmm2
+ movsd %xmm2,%xmm0
+# check the second number
+.L__check2:
+ test $2,%r8d
+ jz .L__r3
+ mov 8(%rcx),%rdx
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__enan2 # jump if mantissa not zero, so it's a NaN
+# inf
+ mov %rdx,%rax
+ rcl $1,%rax
+ jnc .L__r2 # exp(+inf) = inf
+ xor %rdx,%rdx # exp(-inf) = 0
+ jmp .L__r2
+
+#NaN
+.L__enan2:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__r2:
+ movd %rdx,%xmm2
+ shufpd $0,%xmm2,%xmm0
+.L__r3:
+ ret
+
+ .data
+ .align 64
+
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_4040000000000000: .quad 0x04040000000000000 # 32
+ .quad 0x04040000000000000
+.L__real_40F0000000000000: .quad 0x040F0000000000000 # 65536, to protect against really large numbers
+ .quad 0x040F0000000000000
+.L__real_C0F0000000000000: .quad 0x0C0F0000000000000 # -65536, to protect against really large negative numbers
+ .quad 0x0C0F0000000000000
+.L__real_3FA0000000000000: .quad 0x03FA0000000000000 # 1/32
+ .quad 0x03FA0000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+.L__real_infinity: .quad 0x07ff0000000000000 #
+ .quad 0x07ff0000000000000
+.L__real_ninfinity: .quad 0x0fff0000000000000 #
+ .quad 0x0fff0000000000000
+.L__real_thirtytwo_by_log2: .quad 0x040471547652b82fe # thirtytwo_by_log2
+ .quad 0x040471547652b82fe
+.L__real_log2_by_32_lead: .quad 0x03f962e42fe000000 # log2_by_32_lead
+ .quad 0x03f962e42fe000000
+.L__real_log2_by_32_tail: .quad 0x0Bdcf473de6af278e # -log2_by_32_tail
+ .quad 0x0Bdcf473de6af278e
+.L__real_3f56c1728d739765: .quad 0x03f56c1728d739765 # 1.38889490863777199667e-03
+ .quad 0x03f56c1728d739765
+.L__real_3F811115B7AA905E: .quad 0x03F811115B7AA905E # 8.33336798434219616221e-03
+ .quad 0x03F811115B7AA905E
+.L__real_3FA5555555545D4E: .quad 0x03FA5555555545D4E # 4.16666666662260795726e-02
+ .quad 0x03FA5555555545D4E
+.L__real_3FC5555555548F7C: .quad 0x03FC5555555548F7C # 1.66666666665260878863e-01
+ .quad 0x03FC5555555548F7C
+
+
+.L__two_to_jby32_lead_table:
+ .quad 0x03ff0000000000000 # 1
+ .quad 0x03ff059b0d0000000 # 1.0219
+ .quad 0x03ff0b55860000000 # 1.04427
+ .quad 0x03ff11301d0000000 # 1.06714
+ .quad 0x03ff172b830000000 # 1.09051
+ .quad 0x03ff1d48730000000 # 1.11439
+ .quad 0x03ff2387a60000000 # 1.13879
+ .quad 0x03ff29e9df0000000 # 1.16372
+ .quad 0x03ff306fe00000000 # 1.18921
+ .quad 0x03ff371a730000000 # 1.21525
+ .quad 0x03ff3dea640000000 # 1.24186
+ .quad 0x03ff44e0860000000 # 1.26905
+ .quad 0x03ff4bfdad0000000 # 1.29684
+ .quad 0x03ff5342b50000000 # 1.32524
+ .quad 0x03ff5ab07d0000000 # 1.35426
+ .quad 0x03ff6247eb0000000 # 1.38391
+ .quad 0x03ff6a09e60000000 # 1.41421
+ .quad 0x03ff71f75e0000000 # 1.44518
+ .quad 0x03ff7a11470000000 # 1.47683
+ .quad 0x03ff8258990000000 # 1.50916
+ .quad 0x03ff8ace540000000 # 1.54221
+ .quad 0x03ff93737b0000000 # 1.57598
+ .quad 0x03ff9c49180000000 # 1.61049
+ .quad 0x03ffa5503b0000000 # 1.64576
+ .quad 0x03ffae89f90000000 # 1.68179
+ .quad 0x03ffb7f76f0000000 # 1.71862
+ .quad 0x03ffc199bd0000000 # 1.75625
+ .quad 0x03ffcb720d0000000 # 1.79471
+ .quad 0x03ffd5818d0000000 # 1.83401
+ .quad 0x03ffdfc9730000000 # 1.87417
+ .quad 0x03ffea4afa0000000 # 1.91521
+ .quad 0x03fff507650000000 # 1.95714
+ .quad 0 # for alignment
+.L__two_to_jby32_trail_table:
+ .quad 0x00000000000000000 # 0
+ .quad 0x03e48ac2ba1d73e2a # 1.1489e-008
+ .quad 0x03e69f3121ec53172 # 4.83347e-008
+ .quad 0x03df25b50a4ebbf1b # 2.67125e-010
+ .quad 0x03e68faa2f5b9bef9 # 4.65271e-008
+ .quad 0x03e368b9aa7805b80 # 5.24924e-009
+ .quad 0x03e6ceac470cd83f6 # 5.38622e-008
+ .quad 0x03e547f7b84b09745 # 1.90902e-008
+ .quad 0x03e64636e2a5bd1ab # 3.79764e-008
+ .quad 0x03e5ceaa72a9c5154 # 2.69307e-008
+ .quad 0x03e682468446b6824 # 4.49684e-008
+ .quad 0x03e18624b40c4dbd0 # 1.41933e-009
+ .quad 0x03e54d8a89c750e5e # 1.94147e-008
+ .quad 0x03e5a753e077c2a0f # 2.46409e-008
+ .quad 0x03e6a90a852b19260 # 4.94813e-008
+ .quad 0x03e0d2ac258f87d03 # 8.48872e-010
+ .quad 0x03e59fcef32422cbf # 2.42032e-008
+ .quad 0x03e61d8bee7ba46e2 # 3.3242e-008
+ .quad 0x03e4f580c36bea881 # 1.45957e-008
+ .quad 0x03e62999c25159f11 # 3.46453e-008
+ .quad 0x03e415506dadd3e2a # 8.0709e-009
+ .quad 0x03e29b8bc9e8a0388 # 2.99439e-009
+ .quad 0x03e451f8480e3e236 # 9.83622e-009
+ .quad 0x03e41f12ae45a1224 # 8.35492e-009
+ .quad 0x03e62b5a75abd0e6a # 3.48493e-008
+ .quad 0x03e47daf237553d84 # 1.11085e-008
+ .quad 0x03e6b0aa538444196 # 5.03689e-008
+ .quad 0x03e69df20d22a0798 # 4.81896e-008
+ .quad 0x03e69f7490e4bb40b # 4.83654e-008
+ .quad 0x03e4bdcdaf5cb4656 # 1.29746e-008
+ .quad 0x03e452486cc2c7b9d # 9.84533e-009
+ .quad 0x03e66dc8a80ce9f09 # 4.25828e-008
+ .quad 0 # for alignment
+
+
+
diff --git a/src/gas/vrd4frcpa.S b/src/gas/vrd4frcpa.S
new file mode 100644
index 0000000..3ae0b91
--- /dev/null
+++ b/src/gas/vrd4frcpa.S
@@ -0,0 +1,1181 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd4frcpa.asm
+#
+# A vector implementation of the floating point reciprocal approximation function.
+# The goal is to be faster than a divide. This routine provides four double
+# precision results from four double precision inputs. It would not be necessary
+## if SSE defined a double precision instruction similar to the single precision
+# rcpss.
+#
+# Prototype:
+#
+# __m128d,__m128d __vrd4_frcpa(__m128d x1, __m128d x2);
+#
+# Computes an approximate reciprocal of x.
+# A table lookup is performed on the higher 10 bits of the mantissa
+# (not including the implicit bit).
+#
+#
+#
+# This routine computes 4 double precision frcpa values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops.
+#
+
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for get/put bits operation
+.equ p_x2,0x10 # temporary for get/put bits operation
+
+.equ stack_size,0x028
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# parameters are expected as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrd4_frcpa
+ .type __vrd4_frcpa,@function
+__vrd4_frcpa:
+ sub $stack_size,%rsp
+# 10 bit GPR method
+ xor %rax,%rax
+ movdqa .L__mask_expext(%rip),%xmm3
+ movdqa %xmm1,%xmm6
+ movdqa %xmm0,%xmm4
+ movdqa %xmm3,%xmm5
+## if 1/2 bit set, increment the index+exponent
+ psrlq $41,%xmm4
+ psrlq $41,%xmm6
+ movdqa %xmm4,%xmm2
+ paddq .L__int_one(%rip),%xmm4
+ psrlq $1,%xmm4
+ pand .L__mask_10bits(%rip),%xmm4
+# invert the exponent
+ psubq %xmm2,%xmm3
+ movdqa %xmm6,%xmm2
+ paddq .L__int_one(%rip),%xmm6
+ psrlq $1,%xmm6
+ pand .L__mask_10bits(%rip),%xmm6
+ psubq %xmm2,%xmm5
+ pand .L__mask_expext2(%rip),%xmm3
+ pand .L__mask_expext2(%rip),%xmm5
+ psllq $1,%xmm3
+# do the lookup and recombine
+ lea .L__rcp_table(%rip),%rdx
+
+ movdqa %xmm4,p_x(%rsp) # move the indexes to a memory location
+ psllq $1,%xmm5
+ mov p_x(%rsp),%r8 # 3 cycles faster for frcpa, but 2 cycles slower for log
+ mov p_x+8(%rsp),%r9
+ movdqa %xmm6,p_x2(%rsp) # move the indexes to a memory location
+ movd (%rdx,%r9,4),%xmm2 # lookup
+ movd (%rdx,%r8,4),%xmm4 # lookup
+ pslldq $8,%xmm2 # shift by 8 bytes
+ por %xmm4,%xmm2
+ por %xmm2,%xmm3
+ mov p_x2(%rsp),%r8 # 3 cycles faster for frcpa, but 2 cycles slower for log
+ mov p_x2+8(%rsp),%r9
+ movd (%rdx,%r9,4),%xmm2 # lookup
+ movd (%rdx,%r8,4),%xmm4 # lookup
+ pslldq $8,%xmm2 # shift by 8 bytes
+ por %xmm4,%xmm2
+ por %xmm2,%xmm5
+# shift and restore the sign
+ pand .L__mask_sign(%rip),%xmm0
+ pand .L__mask_sign(%rip),%xmm1
+ psllq $40,%xmm3
+ psllq $40,%xmm5
+ por %xmm3,%xmm0
+ por %xmm5,%xmm1
+ add $stack_size,%rsp
+ ret
+
+
+ .data
+ .align 16
+
+.L__int_one: .quad 0x00000000000000001
+ .quad 0x00000000000000001
+
+.L__mask_10bits: .quad 0x000000000000003ff
+ .quad 0x000000000000003ff
+
+.L__mask_expext: .quad 0x000000000003ff000
+ .quad 0x000000000003ff000
+
+.L__mask_expext2: .quad 0x000000000003ff800
+ .quad 0x000000000003ff800
+
+.L__mask_sign: .quad 0x08000000000000000
+ .quad 0x08000000000000000
+
+.L__real_one: .quad 0x03ff0000000000000
+ .quad 0x03ff0000000000000
+.L__real_two: .quad 0x04000000000000000
+ .quad 0x04000000000000000
+
+ .align 16
+
+.L__rcp_table:
+ .long 0x0000
+ .long 0x0FF8
+ .long 0x0FF0
+ .long 0x0FE8
+ .long 0x0FE0
+ .long 0x0FD8
+ .long 0x0FD0
+ .long 0x0FC8
+ .long 0x0FC0
+ .long 0x0FB8
+ .long 0x0FB1
+ .long 0x0FA9
+ .long 0x0FA1
+ .long 0x0F99
+ .long 0x0F91
+ .long 0x0F89
+ .long 0x0F82
+ .long 0x0F7A
+ .long 0x0F72
+ .long 0x0F6B
+ .long 0x0F63
+ .long 0x0F5B
+ .long 0x0F53
+ .long 0x0F4C
+ .long 0x0F44
+ .long 0x0F3D
+ .long 0x0F35
+ .long 0x0F2D
+ .long 0x0F26
+ .long 0x0F1E
+ .long 0x0F17
+ .long 0x0F0F
+ .long 0x0F08
+ .long 0x0F00
+ .long 0x0EF8
+ .long 0x0EF1
+ .long 0x0EEA
+ .long 0x0EE2
+ .long 0x0EDB
+ .long 0x0ED3
+ .long 0x0ECC
+ .long 0x0EC4
+ .long 0x0EBD
+ .long 0x0EB6
+ .long 0x0EAE
+ .long 0x0EA7
+ .long 0x0EA0
+ .long 0x0E98
+ .long 0x0E91
+ .long 0x0E8A
+ .long 0x0E82
+ .long 0x0E7B
+ .long 0x0E74
+ .long 0x0E6D
+ .long 0x0E65
+ .long 0x0E5E
+ .long 0x0E57
+ .long 0x0E50
+ .long 0x0E49
+ .long 0x0E41
+ .long 0x0E3A
+ .long 0x0E33
+ .long 0x0E2C
+ .long 0x0E25
+ .long 0x0E1E
+ .long 0x0E17
+ .long 0x0E10
+ .long 0x0E09
+ .long 0x0E02
+ .long 0x0DFB
+ .long 0x0DF4
+ .long 0x0DED
+ .long 0x0DE6
+ .long 0x0DDF
+ .long 0x0DD8
+ .long 0x0DD1
+ .long 0x0DCA
+ .long 0x0DC3
+ .long 0x0DBC
+ .long 0x0DB5
+ .long 0x0DAE
+ .long 0x0DA7
+ .long 0x0DA0
+ .long 0x0D9A
+ .long 0x0D93
+ .long 0x0D8C
+ .long 0x0D85
+ .long 0x0D7E
+ .long 0x0D77
+ .long 0x0D71
+ .long 0x0D6A
+ .long 0x0D63
+ .long 0x0D5C
+ .long 0x0D56
+ .long 0x0D4F
+ .long 0x0D48
+ .long 0x0D42
+ .long 0x0D3B
+ .long 0x0D34
+ .long 0x0D2E
+ .long 0x0D27
+ .long 0x0D20
+ .long 0x0D1A
+ .long 0x0D13
+ .long 0x0D0C
+ .long 0x0D06
+ .long 0x0CFF
+ .long 0x0CF9
+ .long 0x0CF2
+ .long 0x0CEC
+ .long 0x0CE5
+ .long 0x0CDF
+ .long 0x0CD8
+ .long 0x0CD2
+ .long 0x0CCB
+ .long 0x0CC5
+ .long 0x0CBE
+ .long 0x0CB8
+ .long 0x0CB1
+ .long 0x0CAB
+ .long 0x0CA4
+ .long 0x0C9E
+ .long 0x0C98
+ .long 0x0C91
+ .long 0x0C8B
+ .long 0x0C85
+ .long 0x0C7E
+ .long 0x0C78
+ .long 0x0C72
+ .long 0x0C6B
+ .long 0x0C65
+ .long 0x0C5F
+ .long 0x0C58
+ .long 0x0C52
+ .long 0x0C4C
+ .long 0x0C46
+ .long 0x0C3F
+ .long 0x0C39
+ .long 0x0C33
+ .long 0x0C2D
+ .long 0x0C26
+ .long 0x0C20
+ .long 0x0C1A
+ .long 0x0C14
+ .long 0x0C0E
+ .long 0x0C08
+ .long 0x0C02
+ .long 0x0BFB
+ .long 0x0BF5
+ .long 0x0BEF
+ .long 0x0BE9
+ .long 0x0BE3
+ .long 0x0BDD
+ .long 0x0BD7
+ .long 0x0BD1
+ .long 0x0BCB
+ .long 0x0BC5
+ .long 0x0BBF
+ .long 0x0BB9
+ .long 0x0BB3
+ .long 0x0BAD
+ .long 0x0BA7
+ .long 0x0BA1
+ .long 0x0B9B
+ .long 0x0B95
+ .long 0x0B8F
+ .long 0x0B89
+ .long 0x0B83
+ .long 0x0B7D
+ .long 0x0B77
+ .long 0x0B71
+ .long 0x0B6C
+ .long 0x0B66
+ .long 0x0B60
+ .long 0x0B5A
+ .long 0x0B54
+ .long 0x0B4E
+ .long 0x0B48
+ .long 0x0B43
+ .long 0x0B3D
+ .long 0x0B37
+ .long 0x0B31
+ .long 0x0B2B
+ .long 0x0B26
+ .long 0x0B20
+ .long 0x0B1A
+ .long 0x0B14
+ .long 0x0B0F
+ .long 0x0B09
+ .long 0x0B03
+ .long 0x0AFE
+ .long 0x0AF8
+ .long 0x0AF2
+ .long 0x0AED
+ .long 0x0AE7
+ .long 0x0AE1
+ .long 0x0ADC
+ .long 0x0AD6
+ .long 0x0AD0
+ .long 0x0ACB
+ .long 0x0AC5
+ .long 0x0AC0
+ .long 0x0ABA
+ .long 0x0AB4
+ .long 0x0AAF
+ .long 0x0AA9
+ .long 0x0AA4
+ .long 0x0A9E
+ .long 0x0A99
+ .long 0x0A93
+ .long 0x0A8E
+ .long 0x0A88
+ .long 0x0A83
+ .long 0x0A7D
+ .long 0x0A78
+ .long 0x0A72
+ .long 0x0A6D
+ .long 0x0A67
+ .long 0x0A62
+ .long 0x0A5C
+ .long 0x0A57
+ .long 0x0A52
+ .long 0x0A4C
+ .long 0x0A47
+ .long 0x0A41
+ .long 0x0A3C
+ .long 0x0A37
+ .long 0x0A31
+ .long 0x0A2C
+ .long 0x0A27
+ .long 0x0A21
+ .long 0x0A1C
+ .long 0x0A17
+ .long 0x0A11
+ .long 0x0A0C
+ .long 0x0A07
+ .long 0x0A01
+ .long 0x09FC
+ .long 0x09F7
+ .long 0x09F2
+ .long 0x09EC
+ .long 0x09E7
+ .long 0x09E2
+ .long 0x09DD
+ .long 0x09D7
+ .long 0x09D2
+ .long 0x09CD
+ .long 0x09C8
+ .long 0x09C3
+ .long 0x09BD
+ .long 0x09B8
+ .long 0x09B3
+ .long 0x09AE
+ .long 0x09A9
+ .long 0x09A4
+ .long 0x099E
+ .long 0x0999
+ .long 0x0994
+ .long 0x098F
+ .long 0x098A
+ .long 0x0985
+ .long 0x0980
+ .long 0x097B
+ .long 0x0976
+ .long 0x0971
+ .long 0x096C
+ .long 0x0967
+ .long 0x0962
+ .long 0x095C
+ .long 0x0957
+ .long 0x0952
+ .long 0x094D
+ .long 0x0948
+ .long 0x0943
+ .long 0x093E
+ .long 0x0939
+ .long 0x0935
+ .long 0x0930
+ .long 0x092B
+ .long 0x0926
+ .long 0x0921
+ .long 0x091C
+ .long 0x0917
+ .long 0x0912
+ .long 0x090D
+ .long 0x0908
+ .long 0x0903
+ .long 0x08FE
+ .long 0x08FA
+ .long 0x08F5
+ .long 0x08F0
+ .long 0x08EB
+ .long 0x08E6
+ .long 0x08E1
+ .long 0x08DC
+ .long 0x08D8
+ .long 0x08D3
+ .long 0x08CE
+ .long 0x08C9
+ .long 0x08C4
+ .long 0x08C0
+ .long 0x08BB
+ .long 0x08B6
+ .long 0x08B1
+ .long 0x08AC
+ .long 0x08A8
+ .long 0x08A3
+ .long 0x089E
+ .long 0x089A
+ .long 0x0895
+ .long 0x0890
+ .long 0x088B
+ .long 0x0887
+ .long 0x0882
+ .long 0x087D
+ .long 0x0879
+ .long 0x0874
+ .long 0x086F
+ .long 0x086B
+ .long 0x0866
+ .long 0x0861
+ .long 0x085D
+ .long 0x0858
+ .long 0x0853
+ .long 0x084F
+ .long 0x084A
+ .long 0x0846
+ .long 0x0841
+ .long 0x083C
+ .long 0x0838
+ .long 0x0833
+ .long 0x082F
+ .long 0x082A
+ .long 0x0825
+ .long 0x0821
+ .long 0x081C
+ .long 0x0818
+ .long 0x0813
+ .long 0x080F
+ .long 0x080A
+ .long 0x0806
+ .long 0x0801
+ .long 0x07FD
+ .long 0x07F8
+ .long 0x07F4
+ .long 0x07EF
+ .long 0x07EB
+ .long 0x07E6
+ .long 0x07E2
+ .long 0x07DD
+ .long 0x07D9
+ .long 0x07D5
+ .long 0x07D0
+ .long 0x07CC
+ .long 0x07C7
+ .long 0x07C3
+ .long 0x07BE
+ .long 0x07BA
+ .long 0x07B6
+ .long 0x07B1
+ .long 0x07AD
+ .long 0x07A9
+ .long 0x07A4
+ .long 0x07A0
+ .long 0x079B
+ .long 0x0797
+ .long 0x0793
+ .long 0x078E
+ .long 0x078A
+ .long 0x0786
+ .long 0x0781
+ .long 0x077D
+ .long 0x0779
+ .long 0x0774
+ .long 0x0770
+ .long 0x076C
+ .long 0x0768
+ .long 0x0763
+ .long 0x075F
+ .long 0x075B
+ .long 0x0757
+ .long 0x0752
+ .long 0x074E
+ .long 0x074A
+ .long 0x0746
+ .long 0x0741
+ .long 0x073D
+ .long 0x0739
+ .long 0x0735
+ .long 0x0730
+ .long 0x072C
+ .long 0x0728
+ .long 0x0724
+ .long 0x0720
+ .long 0x071C
+ .long 0x0717
+ .long 0x0713
+ .long 0x070F
+ .long 0x070B
+ .long 0x0707
+ .long 0x0703
+ .long 0x06FE
+ .long 0x06FA
+ .long 0x06F6
+ .long 0x06F2
+ .long 0x06EE
+ .long 0x06EA
+ .long 0x06E6
+ .long 0x06E2
+ .long 0x06DE
+ .long 0x06DA
+ .long 0x06D5
+ .long 0x06D1
+ .long 0x06CD
+ .long 0x06C9
+ .long 0x06C5
+ .long 0x06C1
+ .long 0x06BD
+ .long 0x06B9
+ .long 0x06B5
+ .long 0x06B1
+ .long 0x06AD
+ .long 0x06A9
+ .long 0x06A5
+ .long 0x06A1
+ .long 0x069D
+ .long 0x0699
+ .long 0x0695
+ .long 0x0691
+ .long 0x068D
+ .long 0x0689
+ .long 0x0685
+ .long 0x0681
+ .long 0x067D
+ .long 0x0679
+ .long 0x0675
+ .long 0x0671
+ .long 0x066D
+ .long 0x066A
+ .long 0x0666
+ .long 0x0662
+ .long 0x065E
+ .long 0x065A
+ .long 0x0656
+ .long 0x0652
+ .long 0x064E
+ .long 0x064A
+ .long 0x0646
+ .long 0x0643
+ .long 0x063F
+ .long 0x063B
+ .long 0x0637
+ .long 0x0633
+ .long 0x062F
+ .long 0x062B
+ .long 0x0628
+ .long 0x0624
+ .long 0x0620
+ .long 0x061C
+ .long 0x0618
+ .long 0x0614
+ .long 0x0611
+ .long 0x060D
+ .long 0x0609
+ .long 0x0605
+ .long 0x0601
+ .long 0x05FE
+ .long 0x05FA
+ .long 0x05F6
+ .long 0x05F2
+ .long 0x05EF
+ .long 0x05EB
+ .long 0x05E7
+ .long 0x05E3
+ .long 0x05E0
+ .long 0x05DC
+ .long 0x05D8
+ .long 0x05D4
+ .long 0x05D1
+ .long 0x05CD
+ .long 0x05C9
+ .long 0x05C6
+ .long 0x05C2
+ .long 0x05BE
+ .long 0x05BA
+ .long 0x05B7
+ .long 0x05B3
+ .long 0x05AF
+ .long 0x05AC
+ .long 0x05A8
+ .long 0x05A4
+ .long 0x05A1
+ .long 0x059D
+ .long 0x0599
+ .long 0x0596
+ .long 0x0592
+ .long 0x058F
+ .long 0x058B
+ .long 0x0587
+ .long 0x0584
+ .long 0x0580
+ .long 0x057C
+ .long 0x0579
+ .long 0x0575
+ .long 0x0572
+ .long 0x056E
+ .long 0x056B
+ .long 0x0567
+ .long 0x0563
+ .long 0x0560
+ .long 0x055C
+ .long 0x0559
+ .long 0x0555
+ .long 0x0552
+ .long 0x054E
+ .long 0x054A
+ .long 0x0547
+ .long 0x0543
+ .long 0x0540
+ .long 0x053C
+ .long 0x0539
+ .long 0x0535
+ .long 0x0532
+ .long 0x052E
+ .long 0x052B
+ .long 0x0527
+ .long 0x0524
+ .long 0x0520
+ .long 0x051D
+ .long 0x0519
+ .long 0x0516
+ .long 0x0512
+ .long 0x050F
+ .long 0x050B
+ .long 0x0508
+ .long 0x0505
+ .long 0x0501
+ .long 0x04FE
+ .long 0x04FA
+ .long 0x04F7
+ .long 0x04F3
+ .long 0x04F0
+ .long 0x04EC
+ .long 0x04E9
+ .long 0x04E6
+ .long 0x04E2
+ .long 0x04DF
+ .long 0x04DB
+ .long 0x04D8
+ .long 0x04D5
+ .long 0x04D1
+ .long 0x04CE
+ .long 0x04CA
+ .long 0x04C7
+ .long 0x04C4
+ .long 0x04C0
+ .long 0x04BD
+ .long 0x04BA
+ .long 0x04B6
+ .long 0x04B3
+ .long 0x04B0
+ .long 0x04AC
+ .long 0x04A9
+ .long 0x04A6
+ .long 0x04A2
+ .long 0x049F
+ .long 0x049C
+ .long 0x0498
+ .long 0x0495
+ .long 0x0492
+ .long 0x048E
+ .long 0x048B
+ .long 0x0488
+ .long 0x0484
+ .long 0x0481
+ .long 0x047E
+ .long 0x047B
+ .long 0x0477
+ .long 0x0474
+ .long 0x0471
+ .long 0x046E
+ .long 0x046A
+ .long 0x0467
+ .long 0x0464
+ .long 0x0461
+ .long 0x045D
+ .long 0x045A
+ .long 0x0457
+ .long 0x0454
+ .long 0x0450
+ .long 0x044D
+ .long 0x044A
+ .long 0x0447
+ .long 0x0444
+ .long 0x0440
+ .long 0x043D
+ .long 0x043A
+ .long 0x0437
+ .long 0x0434
+ .long 0x0430
+ .long 0x042D
+ .long 0x042A
+ .long 0x0427
+ .long 0x0424
+ .long 0x0420
+ .long 0x041D
+ .long 0x041A
+ .long 0x0417
+ .long 0x0414
+ .long 0x0411
+ .long 0x040E
+ .long 0x040A
+ .long 0x0407
+ .long 0x0404
+ .long 0x0401
+ .long 0x03FE
+ .long 0x03FB
+ .long 0x03F8
+ .long 0x03F5
+ .long 0x03F1
+ .long 0x03EE
+ .long 0x03EB
+ .long 0x03E8
+ .long 0x03E5
+ .long 0x03E2
+ .long 0x03DF
+ .long 0x03DC
+ .long 0x03D9
+ .long 0x03D6
+ .long 0x03D3
+ .long 0x03CF
+ .long 0x03CC
+ .long 0x03C9
+ .long 0x03C6
+ .long 0x03C3
+ .long 0x03C0
+ .long 0x03BD
+ .long 0x03BA
+ .long 0x03B7
+ .long 0x03B4
+ .long 0x03B1
+ .long 0x03AE
+ .long 0x03AB
+ .long 0x03A8
+ .long 0x03A5
+ .long 0x03A2
+ .long 0x039F
+ .long 0x039C
+ .long 0x0399
+ .long 0x0396
+ .long 0x0393
+ .long 0x0390
+ .long 0x038D
+ .long 0x038A
+ .long 0x0387
+ .long 0x0384
+ .long 0x0381
+ .long 0x037E
+ .long 0x037B
+ .long 0x0378
+ .long 0x0375
+ .long 0x0372
+ .long 0x036F
+ .long 0x036C
+ .long 0x0369
+ .long 0x0366
+ .long 0x0363
+ .long 0x0360
+ .long 0x035E
+ .long 0x035B
+ .long 0x0358
+ .long 0x0355
+ .long 0x0352
+ .long 0x034F
+ .long 0x034C
+ .long 0x0349
+ .long 0x0346
+ .long 0x0343
+ .long 0x0340
+ .long 0x033E
+ .long 0x033B
+ .long 0x0338
+ .long 0x0335
+ .long 0x0332
+ .long 0x032F
+ .long 0x032C
+ .long 0x0329
+ .long 0x0327
+ .long 0x0324
+ .long 0x0321
+ .long 0x031E
+ .long 0x031B
+ .long 0x0318
+ .long 0x0315
+ .long 0x0313
+ .long 0x0310
+ .long 0x030D
+ .long 0x030A
+ .long 0x0307
+ .long 0x0304
+ .long 0x0302
+ .long 0x02FF
+ .long 0x02FC
+ .long 0x02F9
+ .long 0x02F6
+ .long 0x02F3
+ .long 0x02F1
+ .long 0x02EE
+ .long 0x02EB
+ .long 0x02E8
+ .long 0x02E5
+ .long 0x02E3
+ .long 0x02E0
+ .long 0x02DD
+ .long 0x02DA
+ .long 0x02D8
+ .long 0x02D5
+ .long 0x02D2
+ .long 0x02CF
+ .long 0x02CC
+ .long 0x02CA
+ .long 0x02C7
+ .long 0x02C4
+ .long 0x02C1
+ .long 0x02BF
+ .long 0x02BC
+ .long 0x02B9
+ .long 0x02B7
+ .long 0x02B4
+ .long 0x02B1
+ .long 0x02AE
+ .long 0x02AC
+ .long 0x02A9
+ .long 0x02A6
+ .long 0x02A3
+ .long 0x02A1
+ .long 0x029E
+ .long 0x029B
+ .long 0x0299
+ .long 0x0296
+ .long 0x0293
+ .long 0x0291
+ .long 0x028E
+ .long 0x028B
+ .long 0x0288
+ .long 0x0286
+ .long 0x0283
+ .long 0x0280
+ .long 0x027E
+ .long 0x027B
+ .long 0x0278
+ .long 0x0276
+ .long 0x0273
+ .long 0x0270
+ .long 0x026E
+ .long 0x026B
+ .long 0x0268
+ .long 0x0266
+ .long 0x0263
+ .long 0x0261
+ .long 0x025E
+ .long 0x025B
+ .long 0x0259
+ .long 0x0256
+ .long 0x0253
+ .long 0x0251
+ .long 0x024E
+ .long 0x024C
+ .long 0x0249
+ .long 0x0246
+ .long 0x0244
+ .long 0x0241
+ .long 0x023E
+ .long 0x023C
+ .long 0x0239
+ .long 0x0237
+ .long 0x0234
+ .long 0x0232
+ .long 0x022F
+ .long 0x022C
+ .long 0x022A
+ .long 0x0227
+ .long 0x0225
+ .long 0x0222
+ .long 0x021F
+ .long 0x021D
+ .long 0x021A
+ .long 0x0218
+ .long 0x0215
+ .long 0x0213
+ .long 0x0210
+ .long 0x020E
+ .long 0x020B
+ .long 0x0208
+ .long 0x0206
+ .long 0x0203
+ .long 0x0201
+ .long 0x01FE
+ .long 0x01FC
+ .long 0x01F9
+ .long 0x01F7
+ .long 0x01F4
+ .long 0x01F2
+ .long 0x01EF
+ .long 0x01ED
+ .long 0x01EA
+ .long 0x01E8
+ .long 0x01E5
+ .long 0x01E3
+ .long 0x01E0
+ .long 0x01DE
+ .long 0x01DB
+ .long 0x01D9
+ .long 0x01D6
+ .long 0x01D4
+ .long 0x01D1
+ .long 0x01CF
+ .long 0x01CC
+ .long 0x01CA
+ .long 0x01C7
+ .long 0x01C5
+ .long 0x01C2
+ .long 0x01C0
+ .long 0x01BD
+ .long 0x01BB
+ .long 0x01B9
+ .long 0x01B6
+ .long 0x01B4
+ .long 0x01B1
+ .long 0x01AF
+ .long 0x01AC
+ .long 0x01AA
+ .long 0x01A7
+ .long 0x01A5
+ .long 0x01A3
+ .long 0x01A0
+ .long 0x019E
+ .long 0x019B
+ .long 0x0199
+ .long 0x0196
+ .long 0x0194
+ .long 0x0192
+ .long 0x018F
+ .long 0x018D
+ .long 0x018A
+ .long 0x0188
+ .long 0x0186
+ .long 0x0183
+ .long 0x0181
+ .long 0x017E
+ .long 0x017C
+ .long 0x017A
+ .long 0x0177
+ .long 0x0175
+ .long 0x0173
+ .long 0x0170
+ .long 0x016E
+ .long 0x016B
+ .long 0x0169
+ .long 0x0167
+ .long 0x0164
+ .long 0x0162
+ .long 0x0160
+ .long 0x015D
+ .long 0x015B
+ .long 0x0159
+ .long 0x0156
+ .long 0x0154
+ .long 0x0151
+ .long 0x014F
+ .long 0x014D
+ .long 0x014A
+ .long 0x0148
+ .long 0x0146
+ .long 0x0143
+ .long 0x0141
+ .long 0x013F
+ .long 0x013C
+ .long 0x013A
+ .long 0x0138
+ .long 0x0136
+ .long 0x0133
+ .long 0x0131
+ .long 0x012F
+ .long 0x012C
+ .long 0x012A
+ .long 0x0128
+ .long 0x0125
+ .long 0x0123
+ .long 0x0121
+ .long 0x011F
+ .long 0x011C
+ .long 0x011A
+ .long 0x0118
+ .long 0x0115
+ .long 0x0113
+ .long 0x0111
+ .long 0x010F
+ .long 0x010C
+ .long 0x010A
+ .long 0x0108
+ .long 0x0105
+ .long 0x0103
+ .long 0x0101
+ .long 0x00FF
+ .long 0x00FC
+ .long 0x00FA
+ .long 0x00F8
+ .long 0x00F6
+ .long 0x00F3
+ .long 0x00F1
+ .long 0x00EF
+ .long 0x00ED
+ .long 0x00EA
+ .long 0x00E8
+ .long 0x00E6
+ .long 0x00E4
+ .long 0x00E2
+ .long 0x00DF
+ .long 0x00DD
+ .long 0x00DB
+ .long 0x00D9
+ .long 0x00D6
+ .long 0x00D4
+ .long 0x00D2
+ .long 0x00D0
+ .long 0x00CE
+ .long 0x00CB
+ .long 0x00C9
+ .long 0x00C7
+ .long 0x00C5
+ .long 0x00C3
+ .long 0x00C0
+ .long 0x00BE
+ .long 0x00BC
+ .long 0x00BA
+ .long 0x00B8
+ .long 0x00B5
+ .long 0x00B3
+ .long 0x00B1
+ .long 0x00AF
+ .long 0x00AD
+ .long 0x00AB
+ .long 0x00A8
+ .long 0x00A6
+ .long 0x00A4
+ .long 0x00A2
+ .long 0x00A0
+ .long 0x009E
+ .long 0x009B
+ .long 0x0099
+ .long 0x0097
+ .long 0x0095
+ .long 0x0093
+ .long 0x0091
+ .long 0x008F
+ .long 0x008C
+ .long 0x008A
+ .long 0x0088
+ .long 0x0086
+ .long 0x0084
+ .long 0x0082
+ .long 0x0080
+ .long 0x007D
+ .long 0x007B
+ .long 0x0079
+ .long 0x0077
+ .long 0x0075
+ .long 0x0073
+ .long 0x0071
+ .long 0x006F
+ .long 0x006D
+ .long 0x006A
+ .long 0x0068
+ .long 0x0066
+ .long 0x0064
+ .long 0x0062
+ .long 0x0060
+ .long 0x005E
+ .long 0x005C
+ .long 0x005A
+ .long 0x0058
+ .long 0x0056
+ .long 0x0053
+ .long 0x0051
+ .long 0x004F
+ .long 0x004D
+ .long 0x004B
+ .long 0x0049
+ .long 0x0047
+ .long 0x0045
+ .long 0x0043
+ .long 0x0041
+ .long 0x003F
+ .long 0x003D
+ .long 0x003B
+ .long 0x0039
+ .long 0x0036
+ .long 0x0034
+ .long 0x0032
+ .long 0x0030
+ .long 0x002E
+ .long 0x002C
+ .long 0x002A
+ .long 0x0028
+ .long 0x0026
+ .long 0x0024
+ .long 0x0022
+ .long 0x0020
+ .long 0x001E
+ .long 0x001C
+ .long 0x001A
+ .long 0x0018
+ .long 0x0016
+ .long 0x0014
+ .long 0x0012
+ .long 0x0010
+ .long 0x000E
+ .long 0x000C
+ .long 0x000A
+ .long 0x0008
+ .long 0x0006
+ .long 0x0004
+ .long 0x0002
+
diff --git a/src/gas/vrd4log.S b/src/gas/vrd4log.S
new file mode 100644
index 0000000..1e2b1e4
--- /dev/null
+++ b/src/gas/vrd4log.S
@@ -0,0 +1,855 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd4log.asm
+#
+# A vector implementation of the log libm function.
+#
+# Prototype:
+#
+# __m128d,__m128d __vrd4_log(__m128d x1, __m128d x2);
+#
+# Computes the natural log of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error. This version can compute 4 logs in
+# 192 cycles, or 48 per value
+#
+# This routine computes 4 double precision log values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+#
+
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+.equ p_xexp,0x020 # index storage
+
+.equ p_x2,0x030 # temporary for error checking operation
+.equ p_idx2,0x040 # index storage
+.equ p_xexp2,0x050 # index storage
+
+.equ save_xa,0x060 #qword
+.equ save_ya,0x068 #qword
+.equ save_nv,0x070 #qword
+.equ p_iter,0x078 # qword storage for number of loop iterations
+
+.equ save_rbx,0x080 #qword
+
+
+.equ p2_temp,0x090 # second temporary for get/put bits operation
+.equ p2_temp1,0x0b0 # second temporary for exponent multiply
+
+.equ p_n1,0x0c0 # temporary for near one check
+.equ p_n12,0x0d0 # temporary for near one check
+
+
+.equ stack_size,0x0e8
+
+
+# parameters are expected as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrd4_log
+ .type __vrd4_log,@function
+__vrd4_log:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rdi
+
+# process 4 values at a time.
+
+ movdqa %xmm1,p_x2(%rsp) # save the input values
+ movdqa %xmm0,p_x(%rsp) # save the input values
+# compute the logs
+
+## if NaN or inf
+
+# /* Store the exponent of x in xexp and put
+# f into the range [0.5,1) */
+
+ pxor %xmm1,%xmm1
+ movdqa %xmm0,%xmm3
+ psrlq $52,%xmm3
+ psubq .L__mask_1023(%rip),%xmm3
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm6 # xexp
+ movdqa %xmm0,%xmm2
+ subpd .L__real_one(%rip),%xmm2
+
+ movapd %xmm6,p_xexp(%rsp)
+ andpd .L__real_notsign(%rip),%xmm2
+ xor %rax,%rax
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+
+ cmppd $1,.L__real_threshold(%rip),%xmm2
+ movmskpd %xmm2,%ecx
+ movdqa %xmm3,%xmm4
+ mov %ecx,p_n1(%rsp)
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrlq $45,%xmm3
+ movdqa %xmm3,%xmm2
+ psrlq $1,%xmm3
+ paddq .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm2
+ paddq %xmm2,%xmm3
+
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm1
+ pxor %xmm7,%xmm7
+ movdqa p_x2(%rsp),%xmm2
+ movapd p_x2(%rsp),%xmm5
+ psrlq $52,%xmm2
+ psubq .L__mask_1023(%rip),%xmm2
+ packssdw %xmm7,%xmm2
+ subpd .L__real_one(%rip),%xmm5
+ andpd .L__real_notsign(%rip),%xmm5
+ cvtdq2pd %xmm2,%xmm6 # xexp
+ xor %rcx,%rcx
+ cmppd $1,.L__real_threshold(%rip),%xmm5
+ movq %xmm3,p_idx(%rsp)
+
+# reduce and get u
+ por .L__real_half(%rip),%xmm4
+ movdqa %xmm4,%xmm2
+ movapd %xmm6,p_xexp2(%rsp)
+
+ # do near one check
+ movmskpd %xmm5,%edx
+ mov %edx,p_n12(%rsp)
+
+ mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128
+
+
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx(%rsp),%eax
+ movdqa p_x2(%rsp),%xmm6
+
+ movapd .L__real_half(%rip),%xmm5 # .5
+ subpd %xmm1,%xmm2 # f2 = f - f1
+ pand .L__real_mant(%rip),%xmm6
+ mulpd %xmm2,%xmm5
+ addpd %xmm5,%xmm1
+
+ movdqa %xmm6,%xmm8
+ psrlq $45,%xmm6
+ movdqa %xmm6,%xmm4
+
+ psrlq $1,%xmm6
+ paddq .L__mask_040(%rip),%xmm6
+ pand .L__mask_001(%rip),%xmm4
+ paddq %xmm4,%xmm6
+# do error checking here for scheduling. Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+## if NaN or inf
+ movapd %xmm0,%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r8d
+ packssdw %xmm7,%xmm6
+ por .L__real_half(%rip),%xmm8
+ movq %xmm6,p_idx2(%rsp)
+ cvtdq2pd %xmm6,%xmm9
+
+ cmppd $2,.L__real_zero(%rip),%xmm0
+ mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128
+ movmskpd %xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+ divpd %xmm1,%xmm2 # u
+
+# compute the index into the log tables
+#
+
+ movlpd -512(%rdx,%rax,8),%xmm0 # z1
+ mov p_idx+4(%rsp),%ecx
+ movhpd -512(%rdx,%rcx,8),%xmm0 # z1
+# solve for ln(1+u)
+ movapd %xmm2,%xmm1 # u
+ mulpd %xmm2,%xmm2 # u^2
+ movapd %xmm2,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm2,%xmm3 #Cu2
+ mulpd %xmm1,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm2 # u^5
+ movapd .L__real_log2_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm1 # u+Au3
+ mulpd %xmm3,%xmm2 # u5(B+Cu2)
+
+ movapd p_xexp(%rsp),%xmm5 # xexp
+ addpd %xmm2,%xmm1 # poly
+# recombine
+ mulpd %xmm5,%xmm4 # xexp * log2_lead
+ addpd %xmm4,%xmm0 #r1
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%eax
+ mov p_idx2+4(%rsp),%ecx
+ addpd %xmm4,%xmm1
+
+ mulpd .L__real_log2_tail(%rip),%xmm5
+
+ movapd .L__real_half(%rip),%xmm4 # .5
+ subpd %xmm9,%xmm8 # f2 = f - f1
+ mulpd %xmm8,%xmm4
+ addpd %xmm4,%xmm9
+
+ addpd %xmm5,%xmm1 #r2
+ divpd %xmm9,%xmm8 # u
+ movapd p_x2(%rsp),%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r10d
+ movapd p_x2(%rsp),%xmm6
+ cmppd $2,.L__real_zero(%rip),%xmm6
+ movmskpd %xmm6,%r11d
+
+# check for nans/infs
+ test $3,%r8d
+ addpd %xmm1,%xmm0
+ jnz .L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+ test $3,%r9d
+ jnz .L__z_or_n
+
+
+.L__vlog2:
+
+
+ # It seems like a good idea to try and interleave
+ # even more of the following code sooner into the
+ # program. But there were conflicts with the table
+ # index registers, making the problem difficult.
+ # After a lot of work in a branch of this file,
+ # I was not able to match the speed of this version.
+ # CodeAnalyst shows that there is lots of unused add
+ # pipe time around the divides, but the processor
+ # doesn't seem to be able to schedule in those slots.
+
+ movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q
+
+# check for near one
+ mov p_n1(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one1
+.L__vlog2n:
+
+
+ # solve for ln(1+u)
+ movapd %xmm8,%xmm9 # u
+ mulpd %xmm8,%xmm8 # u^2
+ movapd %xmm8,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm8,%xmm3 #Cu2
+ mulpd %xmm9,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm8 # u^5
+ movapd .L__real_log2_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm9 # u+Au3
+ mulpd %xmm3,%xmm8 # u5(B+Cu2)
+
+ movapd p_xexp2(%rsp),%xmm5 # xexp
+ addpd %xmm8,%xmm9 # poly
+ # recombine
+ mulpd %xmm5,%xmm4
+ addpd %xmm4,%xmm7 #r1
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q
+ addpd %xmm2,%xmm9
+
+ mulpd .L__real_log2_tail(%rip),%xmm5
+
+ addpd %xmm5,%xmm9 #r2
+
+ # check for nans/infs
+ test $3,%r10d
+ addpd %xmm9,%xmm7
+ jnz .L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+ test $3,%r11d
+ jnz .L__z_or_n2
+
+.L__vlog4:
+
+
+ mov p_n12(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one2
+
+.L__vlog4n:
+
+# store the result _m128d
+ movapd %xmm7,%xmm1
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+.Lboth_nearone:
+# saves 10 cycles
+# r = x - 1.0;
+ movapd .L__real_two(%rip),%xmm2
+ subpd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addpd %xmm0,%xmm2
+ movapd %xmm0,%xmm1
+ divpd %xmm2,%xmm1 # u
+ movapd .L__real_ca4(%rip),%xmm4 #D
+ movapd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movapd %xmm0,%xmm6
+ mulpd %xmm1,%xmm6 # correction
+# u = u + u;
+ addpd %xmm1,%xmm1 #u
+ movapd %xmm1,%xmm2
+ mulpd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulpd %xmm1,%xmm5 # Cu
+ movapd %xmm1,%xmm3
+ mulpd %xmm2,%xmm3 # u^3
+ mulpd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulpd %xmm3,%xmm4 #Du^3
+
+ addpd .L__real_ca1(%rip),%xmm2 # +A
+ movapd %xmm3,%xmm1
+ mulpd %xmm1,%xmm1 # u^6
+ addpd %xmm4,%xmm5 #Cu+Du3
+
+ mulpd %xmm3,%xmm2 #u3(A+Bu2)
+ mulpd %xmm5,%xmm1 #u6(Cu+Du3)
+ addpd %xmm1,%xmm2
+ subpd %xmm6,%xmm2 # -correction
+
+# return r + r2;
+ addpd %xmm2,%xmm0
+ ret
+
+ .align 16
+.L__near_one1:
+ cmp $3,%r9d
+ jnz .L__n1nb1
+
+ movapd p_x(%rsp),%xmm0
+ call .Lboth_nearone
+ jmp .L__vlog2n
+
+ .align 16
+.L__n1nb1:
+ test $1,%r9d
+ jz .L__lnn12
+
+ movlpd p_x(%rsp),%xmm0
+ call .L__ln1
+
+.L__lnn12:
+ test $2,%r9d # second number?
+ jz .L__lnn1e
+ movlpd %xmm0,p_x(%rsp)
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd p_x(%rsp),%xmm0
+
+.L__lnn1e:
+ jmp .L__vlog2n
+
+
+ .align 16
+.L__near_one2:
+ cmp $3,%r9d
+ jnz .L__n1nb2
+
+ movapd %xmm0,%xmm8
+ movapd p_x2(%rsp),%xmm0
+ call .Lboth_nearone
+ movapd %xmm0,%xmm7
+ movapd %xmm8,%xmm0
+ jmp .L__vlog4n
+
+ .align 16
+.L__n1nb2:
+ movapd %xmm0,%xmm8
+ test $1,%r9d
+ jz .L__lnn22
+
+ movapd %xmm7,%xmm0
+ movlpd p_x2(%rsp),%xmm0
+ call .L__ln1
+ movapd %xmm0,%xmm7
+
+.L__lnn22:
+ test $2,%r9d # second number?
+ jz .L__lnn2e
+ movlpd %xmm7,p_x2(%rsp)
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,p_x2+8(%rsp)
+ movapd p_x2(%rsp),%xmm7
+
+.L__lnn2e:
+ movapd %xmm8,%xmm0
+ jmp .L__vlog4n
+
+ .align 16
+
+.L__ln1:
+# saves 10 cycles
+# r = x - 1.0;
+ movlpd .L__real_two(%rip),%xmm2
+ subsd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addsd %xmm0,%xmm2
+ movsd %xmm0,%xmm1
+ divsd %xmm2,%xmm1 # u
+ movlpd .L__real_ca4(%rip),%xmm4 #D
+ movlpd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movsd %xmm0,%xmm6
+ mulsd %xmm1,%xmm6 # correction
+# u = u + u;
+ addsd %xmm1,%xmm1 #u
+ movsd %xmm1,%xmm2
+ mulsd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulsd %xmm1,%xmm5 # Cu
+ movsd %xmm1,%xmm3
+ mulsd %xmm2,%xmm3 # u^3
+ mulsd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulsd %xmm3,%xmm4 #Du^3
+
+ addsd .L__real_ca1(%rip),%xmm2 # +A
+ movsd %xmm3,%xmm1
+ mulsd %xmm1,%xmm1 # u^6
+ addsd %xmm4,%xmm5 #Cu+Du3
+
+ mulsd %xmm3,%xmm2 #u3(A+Bu2)
+ mulsd %xmm5,%xmm1 #u6(Cu+Du3)
+ addsd %xmm1,%xmm2
+ subsd %xmm6,%xmm2 # -correction
+
+# return r + r2;
+ addsd %xmm2,%xmm0
+ ret
+
+ .align 16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+ test $1,%r8d # first number?
+ jz .L__lninf2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rdx
+ movlpd p_x(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+ test $2,%r8d # second number?
+ jz .L__lninfe
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rdx
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+ jmp .L__vlog1 # continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+ movapd %xmm0,%xmm2
+ test $1,%r10d # first number?
+ jz .L__lninf22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm7,%xmm1 # save the inputs
+ mov p_x2(%rsp),%rdx
+ movlpd p_x2(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm7,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+ movapd %xmm0,%xmm7
+
+.L__lninf22:
+ test $2,%r10d # second number?
+ jz .L__lninfe2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rdx
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+ movapd %xmm2,%xmm0
+ jmp .L__vlog3 # continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__lnan # jump if mantissa not zero, so it's a NaN
+# inf
+ rcl $1,%rdx
+ jnc .L__lne2 # log(+inf) = inf
+# negative x
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+#NaN
+.L__lnan:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__lne:
+ movd %rdx,%xmm0
+.L__lne2:
+ ret
+
+ .align 16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+ test $1,%r9d # first number?
+ jz .L__zn2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+ test $2,%r9d # second number?
+ jz .L__zne
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne:
+ jmp .L__vlog2
+
+.L__z_or_n2:
+ movapd %xmm0,%xmm2
+ test $1,%r11d # first number?
+ jz .L__zn22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm7,%xmm0
+ movapd %xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+ test $2,%r11d # second number?
+ jz .L__zne2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+ movapd %xmm2,%xmm0
+ jmp .L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+ shl $1,%rax
+ jnz .L__zn_x ## if just a carry, then must be negative
+ movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0
+ ret
+.L__zn_x:
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+
+
+ .data
+ .align 16
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000 # for alignment
+.L__real_two: .quad 0x04000000000000000 # 1.0
+ .quad 0x04000000000000000
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0fff0000000000000
+.L__real_inf: .quad 0x07ff0000000000000 # +inf
+ .quad 0x07ff0000000000000
+.L__real_nan: .quad 0x07ff8000000000000 # NaN
+ .quad 0x07ff8000000000000
+
+.L__real_zero: .quad 0x00000000000000000 # 0.0
+ .quad 0x00000000000000000
+
+.L__real_sign: .quad 0x08000000000000000 # sign bit
+ .quad 0x08000000000000000
+.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold: .quad 0x03F9EB85000000000 # .03
+ .quad 0x03F9EB85000000000
+.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit
+ .quad 0x00008000000000000
+.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03f80000000000000
+.L__mask_1023: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+.L__mask_040: .quad 0x00000000000000040 #
+ .quad 0x00000000000000040
+.L__mask_001: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001
+
+.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x03fb55555555554e6
+.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x03f89999999bac6d4
+.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x03f62492307f1519f
+.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x03f3c8034c85dfff0
+
+
+.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02
+ .quad 0x03fb5555555555557
+.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02
+ .quad 0x03f89999999865ede
+.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03
+ .quad 0x03f6249423bd94741
+.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x03fe62e42e0000000
+.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x03e6efa39ef35793c
+
+.L__real_half: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+
+ .align 16
+
+.L__np_ln_lead_table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02
+ .quad 0x3f9f829800000000 # 3.07716131210327148438e-02
+ .quad 0x3fa7745800000000 # 4.58095073699951171875e-02
+ .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02
+ .quad 0x3fb341d700000000 # 7.52233862876892089844e-02
+ .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02
+ .quad 0x3fba926d00000000 # 1.03796780109405517578e-01
+ .quad 0x3fbe270700000000 # 1.17783010005950927734e-01
+ .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01
+ .quad 0x3fc2955280000000 # 1.45181953907012939453e-01
+ .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01
+ .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01
+ .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01
+ .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01
+ .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01
+ .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01
+ .quad 0x3fce270700000000 # 2.35566020011901855469e-01
+ .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01
+ .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01
+ .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01
+ .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01
+ .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01
+ .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01
+ .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01
+ .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01
+ .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01
+ .quad 0x3fd686c800000000 # 3.51976394653320312500e-01
+ .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01
+ .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01
+ .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01
+ .quad 0x3fd9479400000000 # 3.94993782043457031250e-01
+ .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01
+ .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01
+ .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01
+ .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01
+ .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01
+ .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01
+ .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01
+ .quad 0x3fde744240000000 # 4.75845873355865478516e-01
+ .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01
+ .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01
+ .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01
+ .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01
+ .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01
+ .quad 0x3fe109f380000000 # 5.32464742660522460938e-01
+ .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01
+ .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01
+ .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01
+ .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01
+ .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01
+ .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01
+ .quad 0x3fe307d720000000 # 5.94707071781158447266e-01
+ .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01
+ .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01
+ .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01
+ .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01
+ .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01
+ .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01
+ .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01
+ .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01
+ .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01
+ .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01
+ .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01
+ .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_tail_table:
+ .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00
+ .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09
+ .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08
+ .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08
+ .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08
+ .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08
+ .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08
+ .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08
+ .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08
+ .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08
+ .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08
+ .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08
+ .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08
+ .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10
+ .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08
+ .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08
+ .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08
+ .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08
+ .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08
+ .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08
+ .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08
+ .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08
+ .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08
+ .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08
+ .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09
+ .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09
+ .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08
+ .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08
+ .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08
+ .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08
+ .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09
+ .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08
+ .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08
+ .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08
+ .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08
+ .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08
+ .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09
+ .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08
+ .quad 0x03e33071282fb989b # 4.43021445893361960146e-09
+ .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08
+ .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08
+ .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08
+ .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08
+ .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08
+ .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09
+ .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08
+ .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08
+ .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08
+ .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08
+ .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08
+ .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08
+ .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08
+ .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08
+ .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08
+ .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08
+ .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08
+ .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08
+ .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09
+ .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08
+ .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08
+ .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08
+ .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08
+ .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08
+ .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08
+ .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08
+ .quad 0 # for alignment
+
diff --git a/src/gas/vrd4log10.S b/src/gas/vrd4log10.S
new file mode 100644
index 0000000..d0f861c
--- /dev/null
+++ b/src/gas/vrd4log10.S
@@ -0,0 +1,924 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd4log10.asm
+#
+# A vector implementation of the log10 libm function.
+#
+# Prototype:
+#
+# __m128d,__m128d __vrd4_log10(__m128d x1, __m128d x2);
+#
+# Computes the natural log10 of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error. This version can compute 4 log10s in
+# 220 cycles, or 55 per value
+#
+# This routine computes 4 double precision log10 values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+.equ p_xexp,0x020 # index storage
+
+.equ p_x2,0x030 # temporary for error checking operation
+.equ p_idx2,0x040 # index storage
+.equ p_xexp2,0x050 # index storage
+
+.equ save_xa,0x060 #qword
+.equ save_ya,0x068 #qword
+.equ save_nv,0x070 #qword
+.equ p_iter,0x078 # qword storage for number of loop iterations
+
+.equ save_rbx,0x080 #qword
+
+
+.equ p2_temp,0x090 # second temporary for get/put bits operation
+.equ p2_temp1,0x0b0 # second temporary for exponent multiply
+
+.equ p_n1,0x0c0 # temporary for near one check
+.equ p_n12,0x0d0 # temporary for near one check
+
+
+.equ stack_size,0x0e8
+
+
+# parameters are expected as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrd4_log10
+ .type __vrd4_log10,@function
+__vrd4_log10:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rdi
+
+# process 4 values at a time.
+
+ movdqa %xmm1,p_x2(%rsp) # save the input values
+ movdqa %xmm0,p_x(%rsp) # save the input values
+# compute the log10s
+
+## if NaN or inf
+
+# /* Store the exponent of x in xexp and put
+# f into the range [0.5,1) */
+
+ pxor %xmm1,%xmm1
+ movdqa %xmm0,%xmm3
+ psrlq $52,%xmm3
+ psubq .L__mask_1023(%rip),%xmm3
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm6 # xexp
+ movdqa %xmm0,%xmm2
+ subpd .L__real_one(%rip),%xmm2
+
+ movapd %xmm6,p_xexp(%rsp)
+ andpd .L__real_notsign(%rip),%xmm2
+ xor %rax,%rax
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+
+ cmppd $1,.L__real_threshold(%rip),%xmm2
+ movmskpd %xmm2,%ecx
+ movdqa %xmm3,%xmm4
+ mov %ecx,p_n1(%rsp)
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrlq $45,%xmm3
+ movdqa %xmm3,%xmm2
+ psrlq $1,%xmm3
+ paddq .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm2
+ paddq %xmm2,%xmm3
+
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm1
+ pxor %xmm7,%xmm7
+ movdqa p_x2(%rsp),%xmm2
+ movapd p_x2(%rsp),%xmm5
+ psrlq $52,%xmm2
+ psubq .L__mask_1023(%rip),%xmm2
+ packssdw %xmm7,%xmm2
+ subpd .L__real_one(%rip),%xmm5
+ andpd .L__real_notsign(%rip),%xmm5
+ cvtdq2pd %xmm2,%xmm6 # xexp
+ xor %rcx,%rcx
+ cmppd $1,.L__real_threshold(%rip),%xmm5
+ movq %xmm3,p_idx(%rsp)
+
+# reduce and get u
+ por .L__real_half(%rip),%xmm4
+ movdqa %xmm4,%xmm2
+ movapd %xmm6,p_xexp2(%rsp)
+
+ # do near one check
+ movmskpd %xmm5,%edx
+ mov %edx,p_n12(%rsp)
+
+ mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128
+
+
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx(%rsp),%eax
+ movdqa p_x2(%rsp),%xmm6
+
+ movapd .L__real_half(%rip),%xmm5 # .5
+ subpd %xmm1,%xmm2 # f2 = f - f1
+ pand .L__real_mant(%rip),%xmm6
+ mulpd %xmm2,%xmm5
+ addpd %xmm5,%xmm1
+
+ movdqa %xmm6,%xmm8
+ psrlq $45,%xmm6
+ movdqa %xmm6,%xmm4
+
+ psrlq $1,%xmm6
+ paddq .L__mask_040(%rip),%xmm6
+ pand .L__mask_001(%rip),%xmm4
+ paddq %xmm4,%xmm6
+# do error checking here for scheduling. Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+## if NaN or inf
+ movapd %xmm0,%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r8d
+ packssdw %xmm7,%xmm6
+ por .L__real_half(%rip),%xmm8
+ movq %xmm6,p_idx2(%rsp)
+ cvtdq2pd %xmm6,%xmm9
+
+ cmppd $2,.L__real_zero(%rip),%xmm0
+ mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128
+ movmskpd %xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+ divpd %xmm1,%xmm2 # u
+
+# compute the index into the log10 tables
+#
+
+ movlpd -512(%rdx,%rax,8),%xmm0 # z1
+ mov p_idx+4(%rsp),%ecx
+ movhpd -512(%rdx,%rcx,8),%xmm0 # z1
+# solve for ln(1+u)
+ movapd %xmm2,%xmm1 # u
+ mulpd %xmm2,%xmm2 # u^2
+ movapd %xmm2,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm2,%xmm3 #Cu2
+ mulpd %xmm1,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm2 # u^5
+ movapd .L__real_log2_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm1 # u+Au3
+ mulpd %xmm3,%xmm2 # u5(B+Cu2)
+
+ movapd p_xexp(%rsp),%xmm5 # xexp
+ addpd %xmm2,%xmm1 # poly
+# recombine
+ mulpd %xmm5,%xmm4 # xexp * log2_lead
+ addpd %xmm4,%xmm0 #r1
+ movapd %xmm0,%xmm2 #for log10
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q
+ mulpd .L__real_log10e_tail(%rip),%xmm0 #for log10
+ mulpd .L__real_log10e_lead(%rip),%xmm2 #for log10
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%eax
+ mov p_idx2+4(%rsp),%ecx
+ addpd %xmm4,%xmm1
+
+
+
+ mulpd .L__real_log2_tail(%rip),%xmm5
+
+ movapd .L__real_half(%rip),%xmm4 # .5
+
+
+ subpd %xmm9,%xmm8 # f2 = f - f1
+ mulpd %xmm8,%xmm4
+ addpd %xmm4,%xmm9
+
+ addpd %xmm5,%xmm1 #r2
+ movapd %xmm1,%xmm7 #for log10
+ mulpd .L__real_log10e_tail(%rip),%xmm1 #for log10
+ addpd %xmm1,%xmm0 #for log10
+
+
+ divpd %xmm9,%xmm8 # u
+ movapd p_x2(%rsp),%xmm3
+ mulpd .L__real_log10e_lead(%rip),%xmm7 #log10
+ andpd .L__real_inf(%rip),%xmm3
+
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r10d
+ addpd %xmm7,%xmm0 #for log10
+ movapd p_x2(%rsp),%xmm6
+ cmppd $2,.L__real_zero(%rip),%xmm6
+ movmskpd %xmm6,%r11d
+
+
+
+
+# check for nans/infs
+ test $3,%r8d
+ addpd %xmm2,%xmm0 #for log10
+# addpd %xmm1,%xmm0
+ jnz .L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+ test $3,%r9d
+ jnz .L__z_or_n
+
+
+.L__vlog2:
+
+
+ # It seems like a good idea to try and interleave
+ # even more of the following code sooner into the
+ # program. But there were conflicts with the table
+ # index registers, making the problem difficult.
+ # After a lot of work in a branch of this file,
+ # I was not able to match the speed of this version.
+ # CodeAnalyst shows that there is lots of unused add
+ # pipe time around the divides, but the processor
+ # doesn't seem to be able to schedule in those slots.
+
+ movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q
+
+# check for near one
+ mov p_n1(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one1
+.L__vlog2n:
+
+
+ # solve for ln(1+u)
+ movapd %xmm8,%xmm9 # u
+ mulpd %xmm8,%xmm8 # u^2
+ movapd %xmm8,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm8,%xmm3 #Cu2
+ mulpd %xmm9,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm8 # u^5
+ movapd .L__real_log2_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm9 # u+Au3
+ mulpd %xmm3,%xmm8 # u5(B+Cu2)
+
+ movapd p_xexp2(%rsp),%xmm5 # xexp
+ addpd %xmm8,%xmm9 # poly
+ # recombine
+ mulpd %xmm5,%xmm4
+ addpd %xmm4,%xmm7 #r1
+ movapd %xmm7,%xmm6 #for log10
+
+ lea .L__np_ln_tail_table(%rip),%rdx
+ mulpd .L__real_log10e_tail(%rip),%xmm7 #for log10
+ movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q
+ mulpd .L__real_log10e_lead(%rip),%xmm6 #for log10
+ movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q
+ addpd %xmm2,%xmm9
+
+ mulpd .L__real_log2_tail(%rip),%xmm5
+
+ addpd %xmm5,%xmm9 #r2
+ movapd %xmm9,%xmm8 #for log10
+ mulpd .L__real_log10e_tail(%rip),%xmm9 #for log 10
+ addpd %xmm9,%xmm7 #for log10
+ mulpd .L__real_log10e_lead(%rip),%xmm8 #for log10
+ addpd %xmm8,%xmm7 #for log10
+
+ # check for nans/infs
+ test $3,%r10d
+ addpd %xmm6,%xmm7 #for log10
+# addpd %xmm9,%xmm7
+ jnz .L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+ test $3,%r11d
+ jnz .L__z_or_n2
+
+.L__vlog4:
+
+
+ mov p_n12(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one2
+
+.L__vlog4n:
+
+# store the result _m128d
+ movapd %xmm7,%xmm1
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+.Lboth_nearone:
+# saves 10 cycles
+# r = x - 1.0;
+ movapd .L__real_two(%rip),%xmm2
+ subpd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addpd %xmm0,%xmm2
+ movapd %xmm0,%xmm1
+ divpd %xmm2,%xmm1 # u
+ movapd .L__real_ca4(%rip),%xmm4 #D
+ movapd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movapd %xmm0,%xmm6
+ mulpd %xmm1,%xmm6 # correction
+# u = u + u;
+ addpd %xmm1,%xmm1 #u
+ movapd %xmm1,%xmm2
+ mulpd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulpd %xmm1,%xmm5 # Cu
+ movapd %xmm1,%xmm3
+ mulpd %xmm2,%xmm3 # u^3
+ mulpd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulpd %xmm3,%xmm4 #Du^3
+
+ addpd .L__real_ca1(%rip),%xmm2 # +A
+ movapd %xmm3,%xmm1
+ mulpd %xmm1,%xmm1 # u^6
+ addpd %xmm4,%xmm5 #Cu+Du3
+
+ mulpd %xmm3,%xmm2 #u3(A+Bu2)
+ mulpd %xmm5,%xmm1 #u6(Cu+Du3)
+ addpd %xmm1,%xmm2
+ subpd %xmm6,%xmm2 # -correction
+
+# loge to log10
+ movapd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subpd %xmm3,%xmm0
+ addpd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movapd %xmm3,%xmm0
+ movapd %xmm2,%xmm1
+
+ mulpd .L__real_log10e_tail(%rip),%xmm2
+ mulpd .L__real_log10e_tail(%rip),%xmm0
+ mulpd .L__real_log10e_lead(%rip),%xmm1
+ mulpd .L__real_log10e_lead(%rip),%xmm3
+ addpd %xmm2,%xmm0
+ addpd %xmm1,%xmm0
+ addpd %xmm3,%xmm0
+# return r + r2;
+# addpd %xmm2,%xmm0
+ ret
+
+ .align 16
+.L__near_one1:
+ cmp $3,%r9d
+ jnz .L__n1nb1
+
+ movapd p_x(%rsp),%xmm0
+ call .Lboth_nearone
+ jmp .L__vlog2n
+
+ .align 16
+.L__n1nb1:
+ test $1,%r9d
+ jz .L__lnn12
+
+ movlpd p_x(%rsp),%xmm0
+ call .L__ln1
+
+.L__lnn12:
+ test $2,%r9d # second number?
+ jz .L__lnn1e
+ movlpd %xmm0,p_x(%rsp)
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd p_x(%rsp),%xmm0
+
+.L__lnn1e:
+ jmp .L__vlog2n
+
+
+ .align 16
+.L__near_one2:
+ cmp $3,%r9d
+ jnz .L__n1nb2
+
+ movapd %xmm0,%xmm8
+ movapd p_x2(%rsp),%xmm0
+ call .Lboth_nearone
+ movapd %xmm0,%xmm7
+ movapd %xmm8,%xmm0
+ jmp .L__vlog4n
+
+ .align 16
+.L__n1nb2:
+ movapd %xmm0,%xmm8
+ test $1,%r9d
+ jz .L__lnn22
+
+ movapd %xmm7,%xmm0
+ movlpd p_x2(%rsp),%xmm0
+ call .L__ln1
+ movapd %xmm0,%xmm7
+
+.L__lnn22:
+ test $2,%r9d # second number?
+ jz .L__lnn2e
+ movlpd %xmm7,p_x2(%rsp)
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,p_x2+8(%rsp)
+ movapd p_x2(%rsp),%xmm7
+
+.L__lnn2e:
+ movapd %xmm8,%xmm0
+ jmp .L__vlog4n
+
+ .align 16
+
+.L__ln1:
+# saves 10 cycles
+# r = x - 1.0;
+ movlpd .L__real_two(%rip),%xmm2
+ subsd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addsd %xmm0,%xmm2
+ movsd %xmm0,%xmm1
+ divsd %xmm2,%xmm1 # u
+ movlpd .L__real_ca4(%rip),%xmm4 #D
+ movlpd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movsd %xmm0,%xmm6
+ mulsd %xmm1,%xmm6 # correction
+# u = u + u;
+ addsd %xmm1,%xmm1 #u
+ movsd %xmm1,%xmm2
+ mulsd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulsd %xmm1,%xmm5 # Cu
+ movsd %xmm1,%xmm3
+ mulsd %xmm2,%xmm3 # u^3
+ mulsd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulsd %xmm3,%xmm4 #Du^3
+
+ addsd .L__real_ca1(%rip),%xmm2 # +A
+ movsd %xmm3,%xmm1
+ mulsd %xmm1,%xmm1 # u^6
+ addsd %xmm4,%xmm5 #Cu+Du3
+
+ mulsd %xmm3,%xmm2 #u3(A+Bu2)
+ mulsd %xmm5,%xmm1 #u6(Cu+Du3)
+ addsd %xmm1,%xmm2
+ subsd %xmm6,%xmm2 # -correction
+
+# loge to log10
+ movsd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subsd %xmm3,%xmm0
+ addsd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movsd %xmm3,%xmm0
+ movsd %xmm2,%xmm1
+
+ mulsd .L__real_log10e_tail(%rip),%xmm2
+ mulsd .L__real_log10e_tail(%rip),%xmm0
+ mulsd .L__real_log10e_lead(%rip),%xmm1
+ mulsd .L__real_log10e_lead(%rip),%xmm3
+ addsd %xmm2,%xmm0
+ addsd %xmm1,%xmm0
+ addsd %xmm3,%xmm0
+
+# return r + r2;
+# addsd %xmm2,%xmm0
+ ret
+
+ .align 16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+ test $1,%r8d # first number?
+ jz .L__lninf2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rdx
+ movlpd p_x(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+ test $2,%r8d # second number?
+ jz .L__lninfe
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rdx
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+ jmp .L__vlog1 # continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+ movapd %xmm0,%xmm2
+ test $1,%r10d # first number?
+ jz .L__lninf22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm7,%xmm1 # save the inputs
+ mov p_x2(%rsp),%rdx
+ movlpd p_x2(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm7,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+ movapd %xmm0,%xmm7
+
+.L__lninf22:
+ test $2,%r10d # second number?
+ jz .L__lninfe2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rdx
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+ movapd %xmm2,%xmm0
+ jmp .L__vlog3 # continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__lnan # jump if mantissa not zero, so it's a NaN
+# inf
+ rcl $1,%rdx
+ jnc .L__lne2 # log(+inf) = inf
+# negative x
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+#NaN
+.L__lnan:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__lne:
+ movd %rdx,%xmm0
+.L__lne2:
+ ret
+
+ .align 16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+ test $1,%r9d # first number?
+ jz .L__zn2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+ test $2,%r9d # second number?
+ jz .L__zne
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne:
+ jmp .L__vlog2
+
+.L__z_or_n2:
+ movapd %xmm0,%xmm2
+ test $1,%r11d # first number?
+ jz .L__zn22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm7,%xmm0
+ movapd %xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+ test $2,%r11d # second number?
+ jz .L__zne2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+ movapd %xmm2,%xmm0
+ jmp .L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+ shl $1,%rax
+ jnz .L__zn_x ## if just a carry, then must be negative
+ movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0
+ ret
+.L__zn_x:
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+
+
+ .data
+ .align 16
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000 # for alignment
+.L__real_two: .quad 0x04000000000000000 # 1.0
+ .quad 0x04000000000000000
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0fff0000000000000
+.L__real_inf: .quad 0x07ff0000000000000 # +inf
+ .quad 0x07ff0000000000000
+.L__real_nan: .quad 0x07ff8000000000000 # NaN
+ .quad 0x07ff8000000000000
+
+.L__real_zero: .quad 0x00000000000000000 # 0.0
+ .quad 0x00000000000000000
+
+.L__real_sign: .quad 0x08000000000000000 # sign bit
+ .quad 0x08000000000000000
+.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold: .quad 0x03FB082C000000000 # .064495086669921875 Threshold
+ .quad 0x03FB082C000000000
+.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit
+ .quad 0x00008000000000000
+.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03f80000000000000
+.L__mask_1023: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+.L__mask_040: .quad 0x00000000000000040 #
+ .quad 0x00000000000000040
+.L__mask_001: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001
+
+.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x03fb55555555554e6
+.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x03f89999999bac6d4
+.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x03f62492307f1519f
+.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x03f3c8034c85dfff0
+
+
+.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02
+ .quad 0x03fb5555555555557
+.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02
+ .quad 0x03f89999999865ede
+.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03
+ .quad 0x03f6249423bd94741
+.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x03fe62e42e0000000
+.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x03e6efa39ef35793c
+
+.L__real_half: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+
+.L__real_log10e_lead: .quad 0x03fdbcb7800000000 # log10e_lead 4.34293746948242187500e-01
+ .quad 0x03fdbcb7800000000
+.L__real_log10e_tail: .quad 0x03ea8a93728719535 # log10e_tail 7.3495500964015109100644e-7
+ .quad 0x03ea8a93728719535
+
+.L__mask_lower: .quad 0x0ffffffff00000000
+ .quad 0x0ffffffff00000000
+ .align 16
+
+.L__np_ln_lead_table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02
+ .quad 0x3f9f829800000000 # 3.07716131210327148438e-02
+ .quad 0x3fa7745800000000 # 4.58095073699951171875e-02
+ .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02
+ .quad 0x3fb341d700000000 # 7.52233862876892089844e-02
+ .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02
+ .quad 0x3fba926d00000000 # 1.03796780109405517578e-01
+ .quad 0x3fbe270700000000 # 1.17783010005950927734e-01
+ .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01
+ .quad 0x3fc2955280000000 # 1.45181953907012939453e-01
+ .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01
+ .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01
+ .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01
+ .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01
+ .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01
+ .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01
+ .quad 0x3fce270700000000 # 2.35566020011901855469e-01
+ .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01
+ .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01
+ .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01
+ .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01
+ .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01
+ .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01
+ .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01
+ .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01
+ .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01
+ .quad 0x3fd686c800000000 # 3.51976394653320312500e-01
+ .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01
+ .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01
+ .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01
+ .quad 0x3fd9479400000000 # 3.94993782043457031250e-01
+ .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01
+ .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01
+ .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01
+ .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01
+ .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01
+ .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01
+ .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01
+ .quad 0x3fde744240000000 # 4.75845873355865478516e-01
+ .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01
+ .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01
+ .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01
+ .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01
+ .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01
+ .quad 0x3fe109f380000000 # 5.32464742660522460938e-01
+ .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01
+ .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01
+ .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01
+ .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01
+ .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01
+ .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01
+ .quad 0x3fe307d720000000 # 5.94707071781158447266e-01
+ .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01
+ .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01
+ .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01
+ .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01
+ .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01
+ .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01
+ .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01
+ .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01
+ .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01
+ .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01
+ .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01
+ .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_tail_table:
+ .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00
+ .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09
+ .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08
+ .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08
+ .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08
+ .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08
+ .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08
+ .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08
+ .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08
+ .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08
+ .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08
+ .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08
+ .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08
+ .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10
+ .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08
+ .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08
+ .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08
+ .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08
+ .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08
+ .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08
+ .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08
+ .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08
+ .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08
+ .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08
+ .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09
+ .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09
+ .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08
+ .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08
+ .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08
+ .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08
+ .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09
+ .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08
+ .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08
+ .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08
+ .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08
+ .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08
+ .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09
+ .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08
+ .quad 0x03e33071282fb989b # 4.43021445893361960146e-09
+ .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08
+ .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08
+ .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08
+ .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08
+ .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08
+ .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09
+ .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08
+ .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08
+ .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08
+ .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08
+ .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08
+ .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08
+ .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08
+ .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08
+ .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08
+ .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08
+ .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08
+ .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08
+ .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09
+ .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08
+ .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08
+ .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08
+ .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08
+ .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08
+ .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08
+ .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08
+ .quad 0 # for alignment
+
diff --git a/src/gas/vrd4log2.S b/src/gas/vrd4log2.S
new file mode 100644
index 0000000..bc254cf
--- /dev/null
+++ b/src/gas/vrd4log2.S
@@ -0,0 +1,908 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrd4log2.asm
+#
+# A vector implementation of the log libm function.
+#
+# Prototype:
+#
+# __m128d,__m128d __vrd4_log2(__m128d x1, __m128d x2);
+#
+# Computes the natural log of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error. This version can compute 4 logs in
+# 192 cycles, or 48 per value
+#
+# This routine computes 4 double precision log values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+.equ p_xexp,0x020 # index storage
+
+.equ p_x2,0x030 # temporary for error checking operation
+.equ p_idx2,0x040 # index storage
+.equ p_xexp2,0x050 # index storage
+
+.equ save_xa,0x060 #qword
+.equ save_ya,0x068 #qword
+.equ save_nv,0x070 #qword
+.equ p_iter,0x078 # qword storage for number of loop iterations
+
+.equ save_rbx,0x080 #qword
+
+
+.equ p2_temp,0x090 # second temporary for get/put bits operation
+.equ p2_temp1,0x0b0 # second temporary for exponent multiply
+
+.equ p_n1,0x0c0 # temporary for near one check
+.equ p_n12,0x0d0 # temporary for near one check
+
+
+.equ stack_size,0x0e8
+
+
+# parameters are expected as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrd4_log2
+ .type __vrd4_log2,@function
+__vrd4_log2:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rdi
+
+# process 4 values at a time.
+
+ movdqa %xmm1,p_x2(%rsp) # save the input values
+ movdqa %xmm0,p_x(%rsp) # save the input values
+# compute the logs
+
+## if NaN or inf
+
+# /* Store the exponent of x in xexp and put
+# f into the range [0.5,1) */
+
+ pxor %xmm1,%xmm1
+ movdqa %xmm0,%xmm3
+ psrlq $52,%xmm3
+ psubq .L__mask_1023(%rip),%xmm3
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm6 # xexp
+ movdqa %xmm0,%xmm2
+ subpd .L__real_one(%rip),%xmm2
+
+ movapd %xmm6,p_xexp(%rsp)
+ andpd .L__real_notsign(%rip),%xmm2
+ xor %rax,%rax
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+
+ cmppd $1,.L__real_threshold(%rip),%xmm2
+ movmskpd %xmm2,%ecx
+ movdqa %xmm3,%xmm4
+ mov %ecx,p_n1(%rsp)
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrlq $45,%xmm3
+ movdqa %xmm3,%xmm2
+ psrlq $1,%xmm3
+ paddq .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm2
+ paddq %xmm2,%xmm3
+
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm1
+ pxor %xmm7,%xmm7
+ movdqa p_x2(%rsp),%xmm2
+ movapd p_x2(%rsp),%xmm5
+ psrlq $52,%xmm2
+ psubq .L__mask_1023(%rip),%xmm2
+ packssdw %xmm7,%xmm2
+ subpd .L__real_one(%rip),%xmm5
+ andpd .L__real_notsign(%rip),%xmm5
+ cvtdq2pd %xmm2,%xmm6 # xexp
+ xor %rcx,%rcx
+ cmppd $1,.L__real_threshold(%rip),%xmm5
+ movq %xmm3,p_idx(%rsp)
+
+# reduce and get u
+ por .L__real_half(%rip),%xmm4
+ movdqa %xmm4,%xmm2
+ movapd %xmm6,p_xexp2(%rsp)
+
+ # do near one check
+ movmskpd %xmm5,%edx
+ mov %edx,p_n12(%rsp)
+
+ mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128
+
+
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx(%rsp),%eax
+ movdqa p_x2(%rsp),%xmm6
+
+ movapd .L__real_half(%rip),%xmm5 # .5
+ subpd %xmm1,%xmm2 # f2 = f - f1
+ pand .L__real_mant(%rip),%xmm6
+ mulpd %xmm2,%xmm5
+ addpd %xmm5,%xmm1
+
+ movdqa %xmm6,%xmm8
+ psrlq $45,%xmm6
+ movdqa %xmm6,%xmm4
+
+ psrlq $1,%xmm6
+ paddq .L__mask_040(%rip),%xmm6
+ pand .L__mask_001(%rip),%xmm4
+ paddq %xmm4,%xmm6
+# do error checking here for scheduling. Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+## if NaN or inf
+ movapd %xmm0,%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r8d
+ packssdw %xmm7,%xmm6
+ por .L__real_half(%rip),%xmm8
+ movq %xmm6,p_idx2(%rsp)
+ cvtdq2pd %xmm6,%xmm9
+
+ cmppd $2,.L__real_zero(%rip),%xmm0
+ mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128
+ movmskpd %xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+ divpd %xmm1,%xmm2 # u
+
+# compute the index into the log tables
+#
+
+ movlpd -512(%rdx,%rax,8),%xmm0 # z1
+ mov p_idx+4(%rsp),%ecx
+ movhpd -512(%rdx,%rcx,8),%xmm0 # z1
+# solve for ln(1+u)
+ movapd %xmm2,%xmm1 # u
+ mulpd %xmm2,%xmm2 # u^2
+ movapd %xmm2,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm2,%xmm3 #Cu2
+ mulpd %xmm1,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm2 # u^5
+ movapd .L__real_log2e_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm1 # u+Au3
+ movapd %xmm0,%xmm5 #z1 copy
+ mulpd %xmm3,%xmm2 # u5(B+Cu2)
+ movapd .L__real_log2e_tail(%rip),%xmm3
+
+ movapd p_xexp(%rsp),%xmm6 # xexp
+ addpd %xmm2,%xmm1 # poly
+# recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%eax
+ mov p_idx2+4(%rsp),%ecx
+ addpd %xmm2,%xmm1 #z2
+ movapd %xmm1,%xmm2 #z2 copy
+
+
+ mulpd %xmm4,%xmm5
+ mulpd %xmm4,%xmm1
+ movapd .L__real_half(%rip),%xmm4 # .5
+ subpd %xmm9,%xmm8 # f2 = f - f1
+ mulpd %xmm8,%xmm4
+ addpd %xmm4,%xmm9
+ mulpd %xmm3,%xmm2 #z2*log2e_tail
+ mulpd %xmm3,%xmm0 #z1*log2e_tail
+ addpd %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp
+ addpd %xmm2,%xmm0 #z1*log2e_tail + z2*log2e_tail
+ addpd %xmm1,%xmm0 #r2
+
+
+ divpd %xmm9,%xmm8 # u
+ movapd p_x2(%rsp),%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r10d
+ movapd p_x2(%rsp),%xmm6
+ cmppd $2,.L__real_zero(%rip),%xmm6
+ movmskpd %xmm6,%r11d
+
+# check for nans/infs
+ test $3,%r8d
+ addpd %xmm5,%xmm0 #r1+r2
+ jnz .L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+ test $3,%r9d
+ jnz .L__z_or_n
+
+
+.L__vlog2:
+
+
+ # It seems like a good idea to try and interleave
+ # even more of the following code sooner into the
+ # program. But there were conflicts with the table
+ # index registers, making the problem difficult.
+ # After a lot of work in a branch of this file,
+ # I was not able to match the speed of this version.
+ # CodeAnalyst shows that there is lots of unused add
+ # pipe time around the divides, but the processor
+ # doesn't seem to be able to schedule in those slots.
+
+ movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q
+
+# check for near one
+ mov p_n1(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one1
+.L__vlog2n:
+
+
+ # solve for ln(1+u)
+ movapd %xmm8,%xmm9 # u
+ mulpd %xmm8,%xmm8 # u^2
+ movapd %xmm8,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm8,%xmm3 #Cu2
+ mulpd %xmm9,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm8 # u^5
+ movapd .L__real_log2e_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm9 # u+Au3
+ movapd %xmm7,%xmm5 #z1 copy
+ mulpd %xmm3,%xmm8 # u5(B+Cu2)
+ movapd .L__real_log2e_tail(%rip),%xmm3
+ movapd p_xexp2(%rsp),%xmm6 # xexp
+ addpd %xmm8,%xmm9 # poly
+ # recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q
+ addpd %xmm2,%xmm9 #z2
+ movapd %xmm9,%xmm2 #z2 copy
+
+ mulpd %xmm4,%xmm5 #z1*log2e_lead
+ mulpd %xmm4,%xmm9 #z2*log2e_lead
+ mulpd %xmm3,%xmm2 #z2*log2e_tail
+ mulpd %xmm3,%xmm7 #z1*log2e_tail
+ addpd %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp
+ addpd %xmm2,%xmm7 #z1*log2e_tail + z2*log2e_tail
+
+
+ addpd %xmm9,%xmm7 #r2
+
+ # check for nans/infs
+ test $3,%r10d
+ addpd %xmm5,%xmm7
+ jnz .L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+ test $3,%r11d
+ jnz .L__z_or_n2
+
+.L__vlog4:
+
+
+ mov p_n12(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one2
+
+.L__vlog4n:
+
+# store the result _m128d
+ movapd %xmm7,%xmm1
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+.Lboth_nearone:
+# saves 10 cycles
+# r = x - 1.0;
+ movapd .L__real_two(%rip),%xmm2
+ subpd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addpd %xmm0,%xmm2
+ movapd %xmm0,%xmm1
+ divpd %xmm2,%xmm1 # u
+ movapd .L__real_ca4(%rip),%xmm4 #D
+ movapd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movapd %xmm0,%xmm6
+ mulpd %xmm1,%xmm6 # correction
+# u = u + u;
+ addpd %xmm1,%xmm1 #u
+ movapd %xmm1,%xmm2
+ mulpd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulpd %xmm1,%xmm5 # Cu
+ movapd %xmm1,%xmm3
+ mulpd %xmm2,%xmm3 # u^3
+ mulpd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulpd %xmm3,%xmm4 #Du^3
+
+ addpd .L__real_ca1(%rip),%xmm2 # +A
+ movapd %xmm3,%xmm1
+ mulpd %xmm1,%xmm1 # u^6
+ addpd %xmm4,%xmm5 #Cu+Du3
+
+ mulpd %xmm3,%xmm2 #u3(A+Bu2)
+ mulpd %xmm5,%xmm1 #u6(Cu+Du3)
+ addpd %xmm1,%xmm2
+ subpd %xmm6,%xmm2 # -correction
+
+# loge to log2
+ movapd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subpd %xmm3,%xmm0
+ addpd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movapd %xmm3,%xmm0
+ movapd %xmm2,%xmm1
+
+ mulpd .L__real_log2e_tail(%rip),%xmm2
+ mulpd .L__real_log2e_tail(%rip),%xmm0
+ mulpd .L__real_log2e_lead(%rip),%xmm1
+ mulpd .L__real_log2e_lead(%rip),%xmm3
+ addpd %xmm2,%xmm0
+ addpd %xmm1,%xmm0
+ addpd %xmm3,%xmm0
+
+# return r + r2;
+# addpd %xmm2,%xmm0
+ ret
+
+ .align 16
+.L__near_one1:
+ cmp $3,%r9d
+ jnz .L__n1nb1
+
+ movapd p_x(%rsp),%xmm0
+ call .Lboth_nearone
+ jmp .L__vlog2n
+
+ .align 16
+.L__n1nb1:
+ test $1,%r9d
+ jz .L__lnn12
+
+ movlpd p_x(%rsp),%xmm0
+ call .L__ln1
+
+.L__lnn12:
+ test $2,%r9d # second number?
+ jz .L__lnn1e
+ movlpd %xmm0,p_x(%rsp)
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd p_x(%rsp),%xmm0
+
+.L__lnn1e:
+ jmp .L__vlog2n
+
+
+ .align 16
+.L__near_one2:
+ cmp $3,%r9d
+ jnz .L__n1nb2
+
+ movapd %xmm0,%xmm8
+ movapd p_x2(%rsp),%xmm0
+ call .Lboth_nearone
+ movapd %xmm0,%xmm7
+ movapd %xmm8,%xmm0
+ jmp .L__vlog4n
+
+ .align 16
+.L__n1nb2:
+ movapd %xmm0,%xmm8
+ test $1,%r9d
+ jz .L__lnn22
+
+ movapd %xmm7,%xmm0
+ movlpd p_x2(%rsp),%xmm0
+ call .L__ln1
+ movapd %xmm0,%xmm7
+
+.L__lnn22:
+ test $2,%r9d # second number?
+ jz .L__lnn2e
+ movlpd %xmm7,p_x2(%rsp)
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,p_x2+8(%rsp)
+ movapd p_x2(%rsp),%xmm7
+
+.L__lnn2e:
+ movapd %xmm8,%xmm0
+ jmp .L__vlog4n
+
+ .align 16
+
+.L__ln1:
+# saves 10 cycles
+# r = x - 1.0;
+ movlpd .L__real_two(%rip),%xmm2
+ subsd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addsd %xmm0,%xmm2
+ movsd %xmm0,%xmm1
+ divsd %xmm2,%xmm1 # u
+ movlpd .L__real_ca4(%rip),%xmm4 #D
+ movlpd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movsd %xmm0,%xmm6
+ mulsd %xmm1,%xmm6 # correction
+# u = u + u;
+ addsd %xmm1,%xmm1 #u
+ movsd %xmm1,%xmm2
+ mulsd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulsd %xmm1,%xmm5 # Cu
+ movsd %xmm1,%xmm3
+ mulsd %xmm2,%xmm3 # u^3
+ mulsd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulsd %xmm3,%xmm4 #Du^3
+
+ addsd .L__real_ca1(%rip),%xmm2 # +A
+ movsd %xmm3,%xmm1
+ mulsd %xmm1,%xmm1 # u^6
+ addsd %xmm4,%xmm5 #Cu+Du3
+
+ mulsd %xmm3,%xmm2 #u3(A+Bu2)
+ mulsd %xmm5,%xmm1 #u6(Cu+Du3)
+ addsd %xmm1,%xmm2
+ subsd %xmm6,%xmm2 # -correction
+
+
+# loge to log2
+ movsd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subsd %xmm3,%xmm0
+ addsd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movsd %xmm3,%xmm0
+ movsd %xmm2,%xmm1
+
+ mulsd .L__real_log2e_tail(%rip),%xmm2
+ mulsd .L__real_log2e_tail(%rip),%xmm0
+ mulsd .L__real_log2e_lead(%rip),%xmm1
+ mulsd .L__real_log2e_lead(%rip),%xmm3
+ addsd %xmm2,%xmm0
+ addsd %xmm1,%xmm0
+ addsd %xmm3,%xmm0
+
+# return r + r2;
+# addsd %xmm2,%xmm0
+ ret
+
+ .align 16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+ test $1,%r8d # first number?
+ jz .L__lninf2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rdx
+ movlpd p_x(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+ test $2,%r8d # second number?
+ jz .L__lninfe
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rdx
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+ jmp .L__vlog1 # continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+ movapd %xmm0,%xmm2
+ test $1,%r10d # first number?
+ jz .L__lninf22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm7,%xmm1 # save the inputs
+ mov p_x2(%rsp),%rdx
+ movlpd p_x2(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm7,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+ movapd %xmm0,%xmm7
+
+.L__lninf22:
+ test $2,%r10d # second number?
+ jz .L__lninfe2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rdx
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+ movapd %xmm2,%xmm0
+ jmp .L__vlog3 # continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__lnan # jump if mantissa not zero, so it's a NaN
+# inf
+ rcl $1,%rdx
+ jnc .L__lne2 # log(+inf) = inf
+# negative x
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+#NaN
+.L__lnan:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__lne:
+ movd %rdx,%xmm0
+.L__lne2:
+ ret
+
+ .align 16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+ test $1,%r9d # first number?
+ jz .L__zn2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+ test $2,%r9d # second number?
+ jz .L__zne
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne:
+ jmp .L__vlog2
+
+.L__z_or_n2:
+ movapd %xmm0,%xmm2
+ test $1,%r11d # first number?
+ jz .L__zn22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm7,%xmm0
+ movapd %xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+ test $2,%r11d # second number?
+ jz .L__zne2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+ movapd %xmm2,%xmm0
+ jmp .L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+ shl $1,%rax
+ jnz .L__zn_x ## if just a carry, then must be negative
+ movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0
+ ret
+.L__zn_x:
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+
+
+ .data
+ .align 16
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000 # for alignment
+.L__real_two: .quad 0x04000000000000000 # 1.0
+ .quad 0x04000000000000000
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0fff0000000000000
+.L__real_inf: .quad 0x07ff0000000000000 # +inf
+ .quad 0x07ff0000000000000
+.L__real_nan: .quad 0x07ff8000000000000 # NaN
+ .quad 0x07ff8000000000000
+
+.L__real_zero: .quad 0x00000000000000000 # 0.0
+ .quad 0x00000000000000000
+
+.L__real_sign: .quad 0x08000000000000000 # sign bit
+ .quad 0x08000000000000000
+.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold: .quad 0x03F9EB85000000000 # .03
+ .quad 0x03F9EB85000000000
+.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit
+ .quad 0x00008000000000000
+.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03f80000000000000
+.L__mask_1023: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+.L__mask_040: .quad 0x00000000000000040 #
+ .quad 0x00000000000000040
+.L__mask_001: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001
+
+.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x03fb55555555554e6
+.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x03f89999999bac6d4
+.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x03f62492307f1519f
+.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x03f3c8034c85dfff0
+
+
+.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02
+ .quad 0x03fb5555555555557
+.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02
+ .quad 0x03f89999999865ede
+.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03
+ .quad 0x03f6249423bd94741
+.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x03fe62e42e0000000
+.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x03e6efa39ef35793c
+
+.L__real_half: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+.L__real_log2e_lead: .quad 0x03FF7154400000000 # log2e_lead 1.44269180297851562500E+00
+ .quad 0x03FF7154400000000
+.L__real_log2e_tail : .quad 0x03ECB295C17F0BBBE # log2e_tail 3.23791044778235969970E-06
+ .quad 0x03ECB295C17F0BBBE
+.L__mask_lower: .quad 0x0ffffffff00000000
+ .quad 0x0ffffffff00000000
+ .align 16
+
+.L__np_ln_lead_table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02
+ .quad 0x3f9f829800000000 # 3.07716131210327148438e-02
+ .quad 0x3fa7745800000000 # 4.58095073699951171875e-02
+ .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02
+ .quad 0x3fb341d700000000 # 7.52233862876892089844e-02
+ .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02
+ .quad 0x3fba926d00000000 # 1.03796780109405517578e-01
+ .quad 0x3fbe270700000000 # 1.17783010005950927734e-01
+ .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01
+ .quad 0x3fc2955280000000 # 1.45181953907012939453e-01
+ .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01
+ .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01
+ .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01
+ .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01
+ .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01
+ .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01
+ .quad 0x3fce270700000000 # 2.35566020011901855469e-01
+ .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01
+ .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01
+ .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01
+ .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01
+ .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01
+ .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01
+ .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01
+ .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01
+ .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01
+ .quad 0x3fd686c800000000 # 3.51976394653320312500e-01
+ .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01
+ .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01
+ .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01
+ .quad 0x3fd9479400000000 # 3.94993782043457031250e-01
+ .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01
+ .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01
+ .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01
+ .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01
+ .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01
+ .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01
+ .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01
+ .quad 0x3fde744240000000 # 4.75845873355865478516e-01
+ .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01
+ .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01
+ .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01
+ .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01
+ .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01
+ .quad 0x3fe109f380000000 # 5.32464742660522460938e-01
+ .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01
+ .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01
+ .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01
+ .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01
+ .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01
+ .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01
+ .quad 0x3fe307d720000000 # 5.94707071781158447266e-01
+ .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01
+ .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01
+ .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01
+ .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01
+ .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01
+ .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01
+ .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01
+ .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01
+ .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01
+ .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01
+ .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01
+ .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_tail_table:
+ .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00
+ .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09
+ .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08
+ .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08
+ .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08
+ .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08
+ .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08
+ .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08
+ .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08
+ .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08
+ .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08
+ .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08
+ .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08
+ .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10
+ .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08
+ .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08
+ .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08
+ .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08
+ .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08
+ .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08
+ .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08
+ .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08
+ .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08
+ .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08
+ .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09
+ .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09
+ .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08
+ .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08
+ .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08
+ .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08
+ .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09
+ .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08
+ .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08
+ .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08
+ .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08
+ .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08
+ .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09
+ .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08
+ .quad 0x03e33071282fb989b # 4.43021445893361960146e-09
+ .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08
+ .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08
+ .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08
+ .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08
+ .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08
+ .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09
+ .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08
+ .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08
+ .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08
+ .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08
+ .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08
+ .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08
+ .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08
+ .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08
+ .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08
+ .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08
+ .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08
+ .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08
+ .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09
+ .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08
+ .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08
+ .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08
+ .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08
+ .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08
+ .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08
+ .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08
+ .quad 0 # for alignment
+
diff --git a/src/gas/vrd4sin.S b/src/gas/vrd4sin.S
new file mode 100644
index 0000000..b611dfd
--- /dev/null
+++ b/src/gas/vrd4sin.S
@@ -0,0 +1,2915 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+# vrd4sin.s
+#
+# A vector implementation of the sin libm function.
+#
+# Prototype:
+#
+# __m128d,__m128d __vrd4_sin(__m128d x1, __m128d x2);
+#
+# Computes Sine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 double precision Sine values at a time.
+# The four values are passed as packed doubles in xmm0 and xmm1.
+# The four results are returned as packed doubles in xmm0 and xmm1.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# This routine is derived directly from the array version.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.Lcosarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03fa5555555555555
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6
+ .quad 0x0bda907db46cc5e42
+.Lsinarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bfc5555555555555
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03f81111111110bb3
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0bf2a01a019e83e5c
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03ec71de3796cde01
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0be5ae600b42fdfa7
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x03de5e0b2f9a43bb8
+.Lsincosarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x0bda907db46cc5e42
+.Lcossinarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0bda907db46cc5e42
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+
+.Levensin_oddcos_tbl:
+ .quad .Lsinsin_sinsin_piby4 # 0
+ .quad .Lsinsin_sincos_piby4 # 1
+ .quad .Lsinsin_cossin_piby4 # 2
+ .quad .Lsinsin_coscos_piby4 # 3
+
+ .quad .Lsincos_sinsin_piby4 # 4
+ .quad .Lsincos_sincos_piby4 # 5
+ .quad .Lsincos_cossin_piby4 # 6
+ .quad .Lsincos_coscos_piby4 # 7
+
+ .quad .Lcossin_sinsin_piby4 # 8
+ .quad .Lcossin_sincos_piby4 # 9
+ .quad .Lcossin_cossin_piby4 # 10
+ .quad .Lcossin_coscos_piby4 # 11
+
+ .quad .Lcoscos_sinsin_piby4 # 12
+ .quad .Lcoscos_sincos_piby4 # 13
+ .quad .Lcoscos_cossin_piby4 # 14
+ .quad .Lcoscos_coscos_piby4 # 15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .text
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_temp, 0x00 # temporary for get/put bits operation
+.equ p_temp1, 0x10 # temporary for get/put bits operation
+
+.equ save_xmm6, 0x20 # temporary for get/put bits operation
+.equ save_xmm7, 0x30 # temporary for get/put bits operation
+.equ save_xmm8, 0x40 # temporary for get/put bits operation
+.equ save_xmm9, 0x50 # temporary for get/put bits operation
+.equ save_xmm10, 0x60 # temporary for get/put bits operation
+.equ save_xmm11, 0x70 # temporary for get/put bits operation
+.equ save_xmm12, 0x80 # temporary for get/put bits operation
+.equ save_xmm13, 0x90 # temporary for get/put bits operation
+.equ save_xmm14, 0x0A0 # temporary for get/put bits operation
+.equ save_xmm15, 0x0B0 # temporary for get/put bits operation
+
+.equ r, 0x0C0 # pointer to r for remainder_piby2
+.equ rr, 0x0D0 # pointer to r for remainder_piby2
+.equ region, 0x0E0 # pointer to r for remainder_piby2
+
+.equ r1, 0x0F0 # pointer to r for remainder_piby2
+.equ rr1, 0x0100 # pointer to r for remainder_piby2
+.equ region1, 0x0110 # pointer to r for remainder_piby2
+
+.equ p_temp2, 0x0120 # temporary for get/put bits operation
+.equ p_temp3, 0x0130 # temporary for get/put bits operation
+
+.equ p_temp4, 0x0140 # temporary for get/put bits operation
+.equ p_temp5, 0x0150 # temporary for get/put bits operation
+
+.equ p_original, 0x0160 # original x
+.equ p_mask, 0x0170 # original x
+.equ p_sign, 0x0180 # original x
+
+.equ p_original1, 0x0190 # original x
+.equ p_mask1, 0x01A0 # original x
+.equ p_sign1, 0x01B0 # original x
+
+.equ save_r12, 0x01C0 # temporary for get/put bits operation
+.equ save_r13, 0x01D0 # temporary for get/put bits operation
+
+.globl __vrd4_sin
+ .type __vrd4_sin,@function
+__vrd4_sin:
+
+ sub $0x1E8,%rsp
+ mov %r12,save_r12(%rsp) # save r12
+ mov %r13,save_r13(%rsp) # save r13
+
+#DEBUG
+# jmp .Lfinal_check
+#DEBUG
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+movdqa %xmm0,%xmm6
+movdqa %xmm1,%xmm7
+movapd .L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd %xmm2,%xmm0 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+movd %xmm0,%rax #rax is lower arg
+movhpd %xmm0, p_temp+8(%rsp)
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+movd %xmm1,%r8 #r8 is lower arg
+movhpd %xmm1, p_temp1+8(%rsp)
+mov p_temp1+8(%rsp),%r9 #r9 = upper arg
+
+movdqa %xmm0,%xmm12
+movdqa %xmm1,%xmm13
+
+pcmpgtd %xmm6,%xmm12
+pcmpgtd %xmm7,%xmm13
+movdqa %xmm12,%xmm6
+movdqa %xmm13,%xmm7
+psrldq $4,%xmm12
+psrldq $4,%xmm13
+psrldq $8,%xmm6
+psrldq $8,%xmm7
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use +
+
+por %xmm6,%xmm12
+por %xmm7,%xmm13
+movd %xmm12,%r12 #Move Sign to gpr **
+movd %xmm13,%r13 #Move Sign to gpr **
+
+movapd %xmm0,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm0,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm0
+ mulpd %xmm0,%xmm2 # * twobypi
+ mulpd %xmm0,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%r8 # Region
+ movd %xmm5,%r9 # Region
+
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+ mov %r8,%r10
+ mov %r9,%r11
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm0,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+ subpd %xmm0,%xmm6 #rr=rhead-r
+ subpd %xmm1,%xmm7 #rr=rhead-r
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm0,%xmm2 # move r for r2
+ movapd %xmm1,%xmm3 # move r for r2
+
+ mulpd %xmm0,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail
+ subpd %xmm9,%xmm7 #rr=(rhead-r) -rtail
+
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+#DEBUG
+# jmp .Lfinal_check
+#DEBUG
+
+ leaq .Levensin_oddcos_tbl(%rip),%rsi
+ jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm10, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+ movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r+8(%rsp) # store upper r
+ movlpd %xmm6,rr+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) # rr = 0
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+# mov p_original(r%sp),%rax
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) #rr = 0
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5:
+# mov p_original+8(%rsp),%rcx ;upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) #rr = 0
+ mov %r10d,region+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm10,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm0,%xmm0 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm2,%xmm2
+# movlhps %xmm6,%xmm6
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r(%rsp) # store upper r
+ movlpd %xmm6,rr(%rsp) # store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf:
+# mov p_original+8(%rsp),%rcx ; upper arg is nan/inf
+# mov r+8(%rsp),%rcx ; upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) # rr = 0
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm5,region1(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm1,%xmm7 # rhead
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+ subpd %xmm1,%xmm7 # rr=rhead-r
+ subpd %xmm9,%xmm7 # rr=(rhead-r) -rtail
+ movapd %xmm7,rr1(%rsp)
+
+ jmp .L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm10, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm4,region(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ movapd %xmm0,r(%rsp)
+
+ subpd %xmm0,%xmm6 # rr=rhead-r
+ subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail
+ movapd %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+ mov $0x411E848000000000,%r10 #5e5 +
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+ movsd %xmm7,%xmm1
+ subsd %xmm0,%xmm1 # xmm1 = r=(rhead-rtail)
+ subsd %xmm1,%xmm7 # rr=rhead-r
+ subsd %xmm0,%xmm7 # xmm7 = rr=((rhead-r) -rtail)
+ movlpd %xmm1,r1+8(%rsp) # store upper r
+ movlpd %xmm7,rr1+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr1(%rsp),%rsi
+ lea r1(%rsp),%rdi
+ movlpd r1(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf_higher:
+# mov p_original1(%rsp),%r8 ; upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1(%rsp) # rr = 0
+ mov %r10d,region1(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrd4_sin_reconstruct
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr1(%rsp),%rsi
+ lea r1(%rsp),%rdi
+ movsd %xmm1,%xmm0
+ call __amd_remainder_piby2@PLT
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+# mov p_original1(%rsp),%r8
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1(%rsp) #rr = 0
+ mov %r10d,region1(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr1+8(%rsp),%rsi
+ lea r1+8(%rsp),%rdi
+ movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher:
+# mov p_original1+8(%rsp),%r9 ;upper arg is nan/inf
+# movd %xmm6,%r9 ;upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1+8(%rsp) #rr = 0
+ mov %r10d,region1+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .L__vrd4_sin_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm4,region(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ movapd %xmm0,r(%rsp)
+
+ subpd %xmm0,%xmm6 # rr=rhead-r
+ subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail
+ movapd %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm1,%xmm1 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm3,%xmm3
+# movlhps %xmm7,%xmm7
+
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+ movsd %xmm7,%xmm1
+ subsd %xmm0,%xmm1 # xmm0 = r=(rhead-rtail)
+ subsd %xmm1,%xmm7 # rr=rhead-r
+ subsd %xmm0,%xmm7 # xmm6 = rr=((rhead-r) -rtail)
+
+ movlpd %xmm1,r1(%rsp) # store upper r
+ movlpd %xmm7,rr1(%rsp) # store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr1+8(%rsp),%rsi
+ lea r1+8(%rsp),%rdi
+ movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf_higher:
+# mov p_original1+8(%rsp),%r9 ; upper arg is nan/inf
+# mov r1+8(%rsp),%r9 ; upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1+8(%rsp) # rr = 0
+ mov %r10d,region1+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd r(%rsp),%xmm0
+ movapd r1(%rsp),%xmm1
+
+ movapd rr(%rsp),%xmm6
+ movapd rr1(%rsp),%xmm7
+
+ mov region(%rsp),%r8
+ mov region1(%rsp),%r9
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+
+ mov %r8,%r10
+ mov %r9,%r11
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm0,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm0,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+ leaq .Levensin_oddcos_tbl(%rip),%rsi
+ jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_cleanup:
+
+ movapd p_sign(%rsp),%xmm0
+ movapd p_sign1(%rsp),%xmm1
+ xorpd %xmm4,%xmm0 # (+) Sign
+ xorpd %xmm5,%xmm1 # (+) Sign
+
+.Lfinal_check:
+ mov save_r12(%rsp),%r12 # restore r12
+ mov save_r13(%rsp),%r13 # restore r13
+
+ add $0x1E8,%rsp
+ ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+
+
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm2 # x4 recalculate
+ mulpd %xmm3,%xmm3 # x4 recalculate
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subpd %xmm13,%xmm5 # + t
+
+ jmp .L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # s6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # s6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # s3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm10,%xmm10 # move high x4 for cos term
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd %xmm0,%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos
+ movhlps %xmm3,%xmm13 # move high r for cos
+ movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos
+ movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm7,%xmm5 # sin *x3
+ mulsd %xmm10,%xmm8 # cos *x4
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term
+ movsd %xmm12,%xmm6 # Keep high r for cos term
+ movsd %xmm13,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0
+
+ subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx
+ subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x*xx for cos term
+ movhlps %xmm1,%xmm11 # move high x for x*xx for cos term
+
+ mulsd p_temp+8(%rsp),%xmm10 # x * xx
+ mulsd p_temp1+8(%rsp),%xmm11 # x * xx
+
+ movsd %xmm12,%xmm2 # move -t for cos term
+ movsd %xmm13,%xmm3 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm12 #1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm13 #1+(-t)
+ addsd p_temp(%rsp),%xmm4 # sin+xx
+ addsd p_temp1(%rsp),%xmm5 # sin+xx
+ subsd %xmm6,%xmm12 # (1-t) - r
+ subsd %xmm7,%xmm13 # (1-t) - r
+ subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx
+ addsd %xmm0,%xmm4 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+ addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx)
+ addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx)
+ subsd %xmm2,%xmm8 # cos+t
+ subsd %xmm3,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4: # changed from sincos_sincos
+ # xmm1 is cossin and xmm0 is sincos
+
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm1,p_temp3(%rsp) # Store r for the sincos term
+
+ movapd .Lsincosarray+0x50(%rip),%xmm4 # s6
+ movapd .Lcossinarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lsincosarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm10,%xmm10 # move high x4 for cos term
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin)
+ movhlps %xmm3,%xmm7 # move high x2 for x3 for sin term (sincos)
+
+ mulsd %xmm0,%xmm6 # get low x3 for sin term
+ mulsd p_temp3+8(%rsp),%xmm7 # get high x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos (cossin)
+ movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term (sincos)
+
+ movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin)
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos)
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm11,%xmm5 # cos *x4
+ mulsd %xmm10,%xmm8 # cos *x4
+ mulsd %xmm7,%xmm9 # sin *x3
+
+ mulsd p_temp(%rsp),%xmm2 # low 0.5 * x2 * xx for sin term (cossin)
+ mulsd p_temp1+8(%rsp),%xmm13 # high 0.5 * x2 * xx for sin term (sincos)
+
+ movsd %xmm12,%xmm6 # Keep high r for cos term
+ movsd %xmm3,%xmm7 # Keep low r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0
+
+ subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx (cossin)
+ subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx (sincos)
+
+ movhlps %xmm0,%xmm10 # move high x for x*xx for cos term (cossin)
+ movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos)
+
+ mulsd p_temp+8(%rsp),%xmm10 # x * xx
+ mulsd p_temp1(%rsp),%xmm1 # x * xx
+
+ movsd %xmm12,%xmm2 # move -t for cos term
+ movsd %xmm3,%xmm13 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t)
+
+ addsd p_temp(%rsp),%xmm4 # sin+xx +
+ addsd p_temp1+8(%rsp),%xmm9 # sin+xx +
+
+ subsd %xmm6,%xmm12 # (1-t) - r
+ subsd %xmm7,%xmm3 # (1-t) - r
+
+ subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm0,%xmm4 # sin + x +
+ addsd %xmm11,%xmm9 # sin + x +
+
+ addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx)
+ addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm2,%xmm8 # cos+t
+ subsd %xmm13,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm0,p_temp2(%rsp) # Store r
+ movapd %xmm1,p_temp3(%rsp) # Store r
+
+
+ movapd .Lcossinarray+0x50(%rip),%xmm4 # s6
+ movapd .Lcossinarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movhlps %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term
+ mulsd p_temp3+8(%rsp),%xmm7 # get low x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term
+ movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term
+ # Reverse 12 and 2
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos
+
+ mulsd %xmm6,%xmm8 # sin *x3
+ mulsd %xmm7,%xmm9 # sin *x3
+ mulsd %xmm10,%xmm4 # cos *x4
+ mulsd %xmm11,%xmm5 # cos *x4
+
+ mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1+8(%rsp),%xmm13 # 0.5 * x2 * xx for sin term
+ movsd %xmm2,%xmm6 # Keep high r for cos term
+ movsd %xmm3,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0
+
+ subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx
+ subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x for sin term
+ # Reverse 10 and 0
+
+ mulsd p_temp(%rsp),%xmm0 # x * xx
+ mulsd p_temp1(%rsp),%xmm1 # x * xx
+
+ movsd %xmm2,%xmm12 # move -t for cos term
+ movsd %xmm3,%xmm13 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t)
+ addsd p_temp+8(%rsp),%xmm8 # sin+xx
+ addsd p_temp1+8(%rsp),%xmm9 # sin+xx
+
+ subsd %xmm6,%xmm2 # (1-t) - r
+ subsd %xmm7,%xmm3 # (1-t) - r
+
+ subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm8 # sin + x
+ addsd %xmm11,%xmm9 # sin + x
+
+ addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx)
+ addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm12,%xmm4 # cos+t
+ subsd %xmm13,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_sincos_piby4: # changed from sincos_sincos
+ # xmm1 is cossin and xmm0 is sincos
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm0,p_temp2(%rsp) # Store r
+
+
+ movapd .Lcossinarray+0x50(%rip),%xmm4 # s6
+ movapd .Lsincosarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lsincosarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm11,%xmm11 # move high x4 for cos term +
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term +
+ mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term +
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term
+ movhlps %xmm3,%xmm13 # move high r for cos
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin
+
+ mulsd %xmm6,%xmm8 # sin *x3
+ mulsd %xmm11,%xmm9 # cos *x4
+ mulsd %xmm10,%xmm4 # cos *x4
+ mulsd %xmm7,%xmm5 # sin *x3
+
+ mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term
+
+ movsd %xmm2,%xmm6 # Keep high r for cos term
+ movsd %xmm13,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0
+
+ subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx
+ subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x*xx for cos term
+
+ mulsd p_temp(%rsp),%xmm0 # x * xx
+ mulsd p_temp1+8(%rsp),%xmm11 # x * xx
+
+ movsd %xmm2,%xmm12 # move -t for cos term
+ movsd %xmm13,%xmm3 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t)
+
+ addsd p_temp+8(%rsp),%xmm8 # sin+xx
+ addsd p_temp1(%rsp),%xmm5 # sin+xx
+
+ subsd %xmm6,%xmm2 # (1-t) - r
+ subsd %xmm7,%xmm13 # (1-t) - r
+
+ subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx
+
+
+ addsd %xmm10,%xmm8 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+
+ addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx)
+ addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm12,%xmm4 # cos+t
+ subsd %xmm3,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ movapd %xmm2,p_temp2(%rsp) # store x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ movapd %xmm11,p_temp3(%rsp) # store r
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for 0.5*x2
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm10 # x6
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm12 # 0.5 *x2
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm0,%xmm2 # x3 recalculate
+ mulpd %xmm3,%xmm3 # x4 recalculate
+
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm6,%xmm12 # 0.5 * x2 *xx
+ mulpd %xmm1,%xmm7 # x * xx
+
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm12,%xmm4 # -0.5 * x2 *xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm6,%xmm4 # x3 * zs +xx
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+ addpd %xmm0,%xmm4 # +x
+ subpd %xmm13,%xmm5 # + t
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ movapd %xmm3,p_temp3(%rsp) # store x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ movapd %xmm10,p_temp2(%rsp) # store r
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ mulpd %xmm3,%xmm11 # x4
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for 0.5*x2
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm11 # x6
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd .L__real_3fe0000000000000(%rip),%xmm13 # 0.5 *x2
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm2 # x4 recalculate
+ mulpd %xmm1,%xmm3 # x3 recalculate
+
+ movapd p_temp2(%rsp),%xmm12 # r
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm7,%xmm13 # 0.5 * x2 *xx
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x3 * zs
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;;
+ subpd %xmm13,%xmm5 # -0.5 * x2 *xx
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm7,%xmm5 # +xx
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ addpd %xmm1,%xmm5 # +x
+ subpd %xmm12,%xmm4 # + t
+
+ jmp .L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4: #Derive from cossin_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+ movhlps %xmm10,%xmm10 # get upper r for t for cos
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ movsd %xmm0,%xmm8 # lower x for sin
+ mulsd %xmm2,%xmm8 # lower x3 for sin
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # upper x4 for cos
+ movsd %xmm8,%xmm2 # lower x3 for sin
+
+ movsd %xmm6,%xmm9 # lower xx
+ # note using odd reg
+
+ movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm0,%xmm6 # x * xx for upper cos term
+ mulpd %xmm1,%xmm7 # x * xx
+ movhlps %xmm6,%xmm6
+ mulsd p_temp2(%rsp),%xmm9 # xx * 0.5*x2 for sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+ # x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin
+
+ subsd %xmm9,%xmm4 # x3zs - 0.5*x2*xx
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp(%rsp),%xmm4 # +xx
+
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subsd %xmm12,%xmm8 # + t
+ addsd %xmm0,%xmm4 # +x
+ subpd %xmm13,%xmm5 # + t
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4: #Derive from sincos_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcossinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zszc
+ addpd %xmm9,%xmm5 # z
+
+ mulpd %xmm0,%xmm2 # upper x3 for sin
+ mulsd %xmm0,%xmm2 # lower x4 for cos
+ mulpd %xmm3,%xmm3 # x4
+
+ movhlps %xmm6,%xmm9 # upper xx for sin term
+ # note using odd reg
+
+ movlpd p_temp2(%rsp),%xmm12 # lower r for cos term
+ movapd p_temp3(%rsp),%xmm13 # r
+
+
+ mulpd %xmm0,%xmm6 # x * xx for lower cos term
+ mulpd %xmm1,%xmm7 # x * xx
+
+ mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+ mulpd %xmm3,%xmm5
+ # x4 * zc
+
+ movhlps %xmm4,%xmm8 # xmm8= sin, xmm4= cos
+ subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx
+
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp+8(%rsp),%xmm8 # +xx
+
+ movhlps %xmm0,%xmm0 # upper x for sin
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subsd %xmm12,%xmm4 # + t
+ subpd %xmm13,%xmm5 # + t
+ addsd %xmm0,%xmm8 # +x
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+ movhlps %xmm11,%xmm11 # get upper r for t for cos
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zcs
+
+ movsd %xmm1,%xmm9 # lower x for sin
+ mulsd %xmm3,%xmm9 # lower x3 for sin
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # upper x4 for cos
+ movsd %xmm9,%xmm3 # lower x3 for sin
+
+ movsd %xmm7,%xmm8 # lower xx
+ # note using even reg
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx for upper cos term
+ movhlps %xmm7,%xmm7
+ mulsd p_temp3(%rsp),%xmm8 # xx * 0.5*x2 for sin term
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+ # x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin
+
+ subsd %xmm8,%xmm5 # x3zs - 0.5*x2*xx
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1(%rsp),%xmm5 # +xx
+
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subsd %xmm13,%xmm9 # + t
+ addsd %xmm1,%xmm5 # +x
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4: # Derived from sincos_sinsin
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ movhlps %xmm11,%xmm11
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm6,%xmm10 # 0.5*x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zczs
+
+ movsd %xmm3,%xmm12
+ mulsd %xmm1,%xmm12 # low x3 for sin
+
+ mulpd %xmm0, %xmm2 # x3
+ mulpd %xmm3, %xmm3 # high x4 for cos
+ movsd %xmm12,%xmm3 # low x3 for sin
+
+ movhlps %xmm1,%xmm8 # upper x for cos term
+ # note using even reg
+ movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term
+
+ mulsd p_temp1+8(%rsp),%xmm8 # x * xx for upper cos term
+
+ mulsd p_temp3(%rsp),%xmm7 # xx * 0.5*x2 for lower sin term
+
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin
+
+ subsd %xmm7,%xmm5 # x3zs - 0.5*x2*xx
+
+ subsd %xmm8,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx
+ addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1(%rsp),%xmm5 # +xx
+
+ addpd %xmm6,%xmm4 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+
+ addsd %xmm1,%xmm5 # +x
+ addpd %xmm0,%xmm4 # +x
+ subsd %xmm13,%xmm9 # + t
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcossinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm7,%xmm8 # upper xx for sin term
+ # note using even reg
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movlpd p_temp3(%rsp),%xmm13 # lower r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx for lower cos term
+
+ mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos
+
+ subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1+8(%rsp),%xmm9 # +xx
+
+ movhlps %xmm1,%xmm1 # upper x for sin
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subsd %xmm13,%xmm5 # + t
+ addsd %xmm1, %xmm9 # +x
+
+ movlhps %xmm9, %xmm5
+
+ jmp .L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsincos_sinsin_piby4: # Derived from sincos_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcossinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm6,%xmm10 # 0.5x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm0,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm7,%xmm8 # upper xx for sin term
+ # note using even reg
+
+ movlpd p_temp3(%rsp),%xmm13 # lower r for cos term
+
+ mulpd %xmm1,%xmm7 # x * xx for lower cos term
+
+ mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos
+
+ subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx
+
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx
+ addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1+8(%rsp),%xmm9 # +xx
+
+ movhlps %xmm1,%xmm1 # upper x for sin
+ addpd %xmm6,%xmm4 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ addsd %xmm1,%xmm9 # +x
+ addpd %xmm0,%xmm4 # +x
+ subsd %xmm13,%xmm5 # + t
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsinsin_cossin_piby4: # Derived from sincos_sinsin
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # x2
+ movapd %xmm6,p_temp(%rsp) # xx
+
+ movhlps %xmm10,%xmm10
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm7,%xmm11 # 0.5*x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zs
+
+
+ movsd %xmm2,%xmm13
+ mulsd %xmm0,%xmm13 # low x3 for sin
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm2,%xmm2 # high x4 for cos
+ movsd %xmm13,%xmm2 # low x3 for sin
+
+
+ movhlps %xmm0,%xmm9 # upper x for cos term ; note using even reg
+ movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term
+ mulsd p_temp+8(%rsp),%xmm9 # x * xx for upper cos term
+ mulsd p_temp2(%rsp),%xmm6 # xx * 0.5*x2 for lower sin term
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin
+ subsd %xmm6,%xmm4 # x3zs - 0.5*x2*xx
+
+ subsd %xmm9,%xmm10 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx
+
+ addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp(%rsp),%xmm4 # +xx
+
+ addpd %xmm7,%xmm5 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+
+ addsd %xmm0,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+ subsd %xmm12,%xmm8 # + t
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4: # Derived from sincos_coscos
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcossinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm7,%xmm11 # 0.5x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm0,%xmm2 # upper x3 for sin
+ mulsd %xmm0,%xmm2 # lower x4 for cos
+
+ movhlps %xmm6,%xmm9 # upper xx for sin term
+ # note using even reg
+
+ movlpd p_temp2(%rsp),%xmm12 # lower r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx for lower cos term
+
+ mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm9= sin, xmm5= cos
+
+ subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx
+ addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp+8(%rsp),%xmm8 # +xx
+
+ movhlps %xmm0,%xmm0 # upper x for sin
+ addpd %xmm7,%xmm5 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+
+
+ addsd %xmm0,%xmm8 # +x
+ addpd %xmm1,%xmm5 # +x
+ subsd %xmm12,%xmm4 # + t
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#DEBUG
+# xorpd %xmm0, %xmm0
+# xorpd %xmm1, %xmm1
+# jmp .Lfinal_check
+#DEBUG
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ movapd %xmm2,p_temp2(%rsp) # copy of x2
+ movapd %xmm3,p_temp3(%rsp) # copy of x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm2,%xmm10 # x6
+ mulpd %xmm3,%xmm11 # x6
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 *x2
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm6,%xmm2 # 0.5 * x2 *xx
+ mulpd %xmm7,%xmm3 # 0.5 * x2 *xx
+
+ mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zs
+
+ movapd p_temp2(%rsp),%xmm10 # x2
+ movapd p_temp3(%rsp),%xmm11 # x2
+
+ mulpd %xmm0,%xmm10 # x3
+ mulpd %xmm1,%xmm11 # x3
+
+ mulpd %xmm10,%xmm4 # x3 * zs
+ mulpd %xmm11,%xmm5 # x3 * zs
+
+ subpd %xmm2,%xmm4 # -0.5 * x2 *xx
+ subpd %xmm3,%xmm5 # -0.5 * x2 *xx
+
+ addpd %xmm6,%xmm4 # +xx
+ addpd %xmm7,%xmm5 # +xx
+
+ addpd %xmm0,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrd4_sin_cleanup
diff --git a/src/gas/vrda_scaled_logr.S b/src/gas/vrda_scaled_logr.S
new file mode 100644
index 0000000..9d1bdc1
--- /dev/null
+++ b/src/gas/vrda_scaled_logr.S
@@ -0,0 +1,2428 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrda_scaled_logr.s
+#
+# An array implementation of the log libm function.
+# Adapted to provide a scalingi and shifting factor. This routine is
+# used by the ACML RNG distribution functions.
+#
+# Prototype:
+#
+# void vrda_scaled_logr(int n, double *x, double *y, double b);
+#
+# Computes the natural log of x multiplied by a.
+# A reduced precision routine. Uses the intel novel reduction technique
+# with frcpai to compute logs.
+# Also uses only 3 polynomial terms to acheive52-18= 34 significant digits
+#
+# This specialized routine does not handle negative numbers, 0, NaNs, or infinity.
+# This routine is not C99 compliant
+# This version can compute logs in 26
+# cycles with n <= 24
+#
+#
+
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+.equ p_xexp,0x020 # index storage
+
+.equ p_x2,0x030 # temporary for error checking operation
+.equ p_idx2,0x040 # index storage
+.equ p_xexp2,0x050 # index storage
+
+.equ save_xa,0x060 #qword
+.equ save_ya,0x068 #qword
+.equ save_nv,0x070 #qword
+.equ p_iter,0x078 # qword storage for number of loop iterations
+
+
+
+.equ p2_temp,0x090 # second temporary for get/put bits operation
+
+.equ stack_size,0x0e8
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .weak vrda_scaled_logr__
+ .set vrda_scaled_logr__,__vrda_scaled_logr__
+ .weak vrda_scaled_logr_
+ .set vrda_scaled_logr_,__vrda_scaled_logr__
+
+# Fortran interface parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+# rcx - double *b
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array log
+#** VRDA_SCALED_LOG(N,X,Y,B)
+# C equivalent*/
+#void vrda_scaled_logr__(int * n, double *x, double *y,double *b)
+#{
+# vrda_scaled_logr(*n,x,y,b);
+#}
+.globl __vrda_scaled_logr__
+ .type __vrda_scaled_logr__,@function
+__vrda_scaled_logr__:
+ mov (%rdi),%edi
+ movlpd (%rcx),%xmm0
+
+# C interface parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+# xmm0 - double b
+
+ .align 16
+ .p2align 4,,15
+.globl vrda_scaled_logr
+ .type vrda_scaled_logr,@function
+vrda_scaled_logr:
+ sub $stack_size,%rsp
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+
+# move the scale and shift factor to another register
+ movsd %xmm0,%xmm10
+ unpcklpd %xmm10,%xmm10
+
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vda_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlpd (%rsi),%xmm0
+ movhpd 8(%rsi),%xmm0
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+ movlpd -16(%rsi),%xmm1
+ movhpd -8(%rsi),%xmm1
+
+# compute the logs
+
+# movdqa %xmm0,p_x(%rsp) # save the input values
+
+# use the algorithm referenced in the itanic trancendental paper.
+
+# reduction
+# compute r = x frcpa(x) - 1
+ movdqa %xmm0,%xmm8
+ movdqa %xmm1,%xmm9
+
+ call __vrd4_frcpa@PLT
+ movdqa %xmm8,%xmm4
+ movdqa %xmm9,%xmm7
+# invert the exponent
+ psllq $1,%xmm8
+ psllq $1,%xmm9
+ mulpd %xmm0,%xmm4 # r
+ mulpd %xmm1,%xmm7 # r
+ movdqa %xmm8,%xmm5
+ paddq .L__mask_rup(%rip),%xmm8
+ psrlq $53,%xmm8
+ movdqa %xmm9,%xmm6
+ paddq .L__mask_rup(%rip),%xmm6
+ psrlq $53,%xmm6
+ psubq .L__mask_3ff(%rip),%xmm8
+ psubq .L__mask_3ff(%rip),%xmm6
+ pshufd $0x058,%xmm8,%xmm8
+ pshufd $0x058,%xmm6,%xmm6
+
+
+ subpd .L__real_one(%rip),%xmm4
+ subpd .L__real_one(%rip),%xmm7
+
+ cvtdq2pd %xmm8,%xmm0 #N
+ cvtdq2pd %xmm6,%xmm1 #N
+# movdqa %xmm8,%xmm0
+# movdqa %xmm6,%xmm1
+# compute index for table lookup. if 1/2 bit set, increment the index+exponent
+ psrlq $42,%xmm5
+ psrlq $42,%xmm9
+ paddq .L__int_one(%rip),%xmm5
+ paddq .L__int_one(%rip),%xmm9
+ psrlq $1,%xmm5
+ psrlq $1,%xmm9
+ pand .L__mask_3ff(%rip),%xmm5
+ pand .L__mask_3ff(%rip),%xmm9
+ psllq $1,%xmm5
+ psllq $1,%xmm9
+
+ movdqa %xmm5,p_x(%rsp) # move the indexes to a memory location
+ movdqa %xmm9,p_x2(%rsp)
+
+
+ movapd .L__real_third(%rip),%xmm3
+ movdqa %xmm3,%xmm5
+ movapd %xmm4,%xmm2
+ movapd %xmm7,%xmm8
+
+# approximation
+# compute the polynomial
+# p(r) = p1r^2+p2r^3+p3r^4+p4r^5
+
+ mulpd %xmm4,%xmm2 #r^2
+ mulpd %xmm7,%xmm8 #r^2
+
+ mulpd %xmm4,%xmm3 # 1/3r
+ mulpd %xmm7,%xmm5 # 1/3r
+# lookup the f(k) term
+ lea .L__np_lnf_table(%rip),%rdx
+ mov p_x(%rsp),%rcx
+ mov p_x+8(%rsp),%r9
+ movlpd (%rdx,%rcx,8),%xmm6 # lookup
+ movhpd (%rdx,%r9,8),%xmm6 # lookup
+
+ addpd .L__real_half(%rip),%xmm3 # p2 + p3r
+ addpd .L__real_half(%rip),%xmm5 # p2 + p3r
+
+ mov p_x2(%rsp),%rcx
+ mov p_x2+8(%rsp),%r9
+ movlpd (%rdx,%rcx,8),%xmm9 # lookup
+ movhpd (%rdx,%r9,8),%xmm9 # lookup
+
+ mulpd %xmm3,%xmm2 # r2(p2 + p3r)
+ mulpd %xmm5,%xmm8 # r2(p2 + p3r)
+ addpd %xmm4,%xmm2 # +r
+ addpd %xmm7,%xmm8 # +r
+
+
+# reconstruction
+# compute ln(x) = T + r + p(r) where
+# T = N*ln(2)+ln(1/frcpa(x)) via tab of ln(1/frcpa(y)), where y = 1 + k/256, 0<=k<=255
+
+ mulpd .L__real_log2(%rip),%xmm0 # compute N*__real_log2
+ mulpd .L__real_log2(%rip),%xmm1 # compute N*__real_log2
+ addpd %xmm6,%xmm2 # add the new mantissas
+ addpd %xmm9,%xmm8 # add the new mantissas
+ addpd %xmm2,%xmm0
+ addpd %xmm8,%xmm1
+
+
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ mulpd %xmm10,%xmm0
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+
+
+ prefetch 64(%rdi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ mulpd %xmm10,%xmm1
+ movlpd %xmm1,-16(%rdi)
+ movhpd %xmm1,-8(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vda_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vda_cleanup
+
+
+.L__finish:
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+
+
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+# we assume that rdx is pointing at the next x array element,
+# r8 at the next y array element. The number of values left is in
+# save_nv
+.L__vda_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__finish # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorpd %xmm0,%xmm0
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd %xmm0,p_x+16(%rsp)
+
+ mov (%rsi),%rcx # we know there's at least one
+ mov %rcx,p_x(%rsp)
+ cmp $2,%rax
+ jl .L__vdacg
+
+ mov 8(%rsi),%rcx # do the second value
+ mov %rcx,p_x+8(%rsp)
+ cmp $3,%rax
+ jl .L__vdacg
+
+ mov 16(%rsi),%rcx # do the third value
+ mov %rcx,p_x+16(%rsp)
+
+.L__vdacg:
+ mov $4,%rdi # parameter for N
+ lea p_x(%rsp),%rsi # &x parameter
+ lea p2_temp(%rsp),%rdx # &y parameter
+ movsd %xmm10,%xmm0
+ call vrda_scaled_logr@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p2_temp(%rsp),%rcx
+ mov %rcx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+8(%rsp),%rcx
+ mov %rcx,8(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+16(%rsp),%rcx
+ mov %rcx,16(%rdi) # do the third value
+
+.L__vdacgf:
+ jmp .L__finish
+
+ .data
+ .align 64
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+
+.L__real_half: .quad 0x0bfe0000000000000 # 1/2
+ .quad 0x0bfe0000000000000
+.L__real_third: .quad 0x03fd5555555555555 # 1/3
+ .quad 0x03fd5555555555555
+.L__real_fourth: .quad 0x0bfd0000000000000 # 1/4
+ .quad 0x0bfd0000000000000
+
+.L__real_log2: .quad 0x03FE62E42FEFA39EF # 0.693147182465
+ .quad 0x03FE62E42FEFA39EF
+
+.L__mask_3ff: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+
+.L__mask_rup: .quad 0x0000003fffffffffe
+ .quad 0x0000003fffffffffe
+
+.L__int_one: .quad 0x00000000000000001
+ .quad 0x00000000000000001
+
+
+
+
+.L__np_lnf_table:
+#log table Program - logtab.c
+#Built Jan 18 2006 09:51:57
+#Compiler version 1400
+
+ .quad 0x00000000000000000 # 0.000000000000 0
+ .quad 0x00000000000000000
+ .quad 0x03F50020055655885 # 0.000977039648 1
+ .quad 0x03F50020055655885
+ .quad 0x03F60040155D5881E # 0.001955034836 2
+ .quad 0x03F60040155D5881E
+ .quad 0x03F6809048289860A # 0.002933987435 3
+ .quad 0x03F6809048289860A
+ .quad 0x03F70080559588B25 # 0.003913899321 4
+ .quad 0x03F70080559588B25
+ .quad 0x03F740C8A7478788D # 0.004894772377 5
+ .quad 0x03F740C8A7478788D
+ .quad 0x03F78121214586B02 # 0.005876608489 6
+ .quad 0x03F78121214586B02
+ .quad 0x03F7C189CBB0E283F # 0.006859409551 7
+ .quad 0x03F7C189CBB0E283F
+ .quad 0x03F8010157588DE69 # 0.007843177461 8
+ .quad 0x03F8010157588DE69
+ .quad 0x03F82145E939EF1BC # 0.008827914124 9
+ .quad 0x03F82145E939EF1BC
+ .quad 0x03F83D8896A83D7A8 # 0.009690354884 10
+ .quad 0x03F83D8896A83D7A8
+ .quad 0x03F85DDC705054DFF # 0.010676913110 11
+ .quad 0x03F85DDC705054DFF
+ .quad 0x03F87E38762CA0C6D # 0.011664445593 12
+ .quad 0x03F87E38762CA0C6D
+ .quad 0x03F89E9CAC6007563 # 0.012652954261 13
+ .quad 0x03F89E9CAC6007563
+ .quad 0x03F8BF091710935A4 # 0.013642441046 14
+ .quad 0x03F8BF091710935A4
+ .quad 0x03F8DF7DBA6777895 # 0.014632907884 15
+ .quad 0x03F8DF7DBA6777895
+ .quad 0x03F8FBEA8B13C03F9 # 0.015500371846 16
+ .quad 0x03F8FBEA8B13C03F9
+ .quad 0x03F90E3751F24F45C # 0.016492681528 17
+ .quad 0x03F90E3751F24F45C
+ .quad 0x03F91E7D80B1FBF4C # 0.017485976867 18
+ .quad 0x03F91E7D80B1FBF4C
+ .quad 0x03F92CBE4F6CC56C3 # 0.018355920375 19
+ .quad 0x03F92CBE4F6CC56C3
+ .quad 0x03F93D0C443D7258C # 0.019351069108 20
+ .quad 0x03F93D0C443D7258C
+ .quad 0x03F94D5E6176ACC89 # 0.020347209148 21
+ .quad 0x03F94D5E6176ACC89
+ .quad 0x03F95DB4A937DEF10 # 0.021344342472 22
+ .quad 0x03F95DB4A937DEF10
+ .quad 0x03F96C039490E37F4 # 0.022217650494 23
+ .quad 0x03F96C039490E37F4
+ .quad 0x03F97C61B1CF5DED7 # 0.023216651576 24
+ .quad 0x03F97C61B1CF5DED7
+ .quad 0x03F98AB77B3FD6EAD # 0.024091596947 25
+ .quad 0x03F98AB77B3FD6EAD
+ .quad 0x03F99B1D75828E780 # 0.025092472797 26
+ .quad 0x03F99B1D75828E780
+ .quad 0x03F9AB87A478CB7CB # 0.026094351403 27
+ .quad 0x03F9AB87A478CB7CB
+ .quad 0x03F9B9E8027E1916F # 0.026971819338 28
+ .quad 0x03F9B9E8027E1916F
+ .quad 0x03F9CA5A1A18613E6 # 0.027975583538 29
+ .quad 0x03F9CA5A1A18613E6
+ .quad 0x03F9D8C1670325921 # 0.028854704473 30
+ .quad 0x03F9D8C1670325921
+ .quad 0x03F9E93B6EE41F674 # 0.029860361378 31
+ .quad 0x03F9E93B6EE41F674
+ .quad 0x03F9F7A9B16782855 # 0.030741141554 32
+ .quad 0x03F9F7A9B16782855
+ .quad 0x03FA0415D89E74440 # 0.031748698315 33
+ .quad 0x03FA0415D89E74440
+ .quad 0x03FA0C58FA19DFAAB # 0.032757271269 34
+ .quad 0x03FA0C58FA19DFAAB
+ .quad 0x03FA139577CC41C1A # 0.033640607815 35
+ .quad 0x03FA139577CC41C1A
+ .quad 0x03FA1AD398C6CD57C # 0.034524725334 36
+ .quad 0x03FA1AD398C6CD57C
+ .quad 0x03FA231C9C40E204E # 0.035536103423 37
+ .quad 0x03FA231C9C40E204E
+ .quad 0x03FA2A5E4231CF7BD # 0.036421899115 38
+ .quad 0x03FA2A5E4231CF7BD
+ .quad 0x03FA32AB4D4C59CB0 # 0.037435198758 39
+ .quad 0x03FA32AB4D4C59CB0
+ .quad 0x03FA39F07BA0EBD5A # 0.038322679007 40
+ .quad 0x03FA39F07BA0EBD5A
+ .quad 0x03FA424192495D571 # 0.039337907520 41
+ .quad 0x03FA424192495D571
+ .quad 0x03FA498A4C73DA65D # 0.040227078744 42
+ .quad 0x03FA498A4C73DA65D
+ .quad 0x03FA50D4AF75CA86F # 0.041117041297 43
+ .quad 0x03FA50D4AF75CA86F
+ .quad 0x03FA592BBC15215BC # 0.042135112141 44
+ .quad 0x03FA592BBC15215BC
+ .quad 0x03FA6079B00423FF6 # 0.043026775152 45
+ .quad 0x03FA6079B00423FF6
+ .quad 0x03FA67C94F2D4BB65 # 0.043919233935 46
+ .quad 0x03FA67C94F2D4BB65
+ .quad 0x03FA70265A550E77B # 0.044940163069 47
+ .quad 0x03FA70265A550E77B
+ .quad 0x03FA77798F8D6DFDC # 0.045834331871 48
+ .quad 0x03FA77798F8D6DFDC
+ .quad 0x03FA7ECE7267CD123 # 0.046729300926 49
+ .quad 0x03FA7ECE7267CD123
+ .quad 0x03FA873184BC09586 # 0.047753104446 50
+ .quad 0x03FA873184BC09586
+ .quad 0x03FA8E8A02D2E3175 # 0.048649793163 51
+ .quad 0x03FA8E8A02D2E3175
+ .quad 0x03FA95E430F8CE456 # 0.049547286652 52
+ .quad 0x03FA95E430F8CE456
+ .quad 0x03FA9D400FF482586 # 0.050445586359 53
+ .quad 0x03FA9D400FF482586
+ .quad 0x03FAA5AB21CB34A9E # 0.051473203662 54
+ .quad 0x03FAA5AB21CB34A9E
+ .quad 0x03FAAD0AA2E784EF4 # 0.052373235867 55
+ .quad 0x03FAAD0AA2E784EF4
+ .quad 0x03FAB46BD74DA76A0 # 0.053274078860 56
+ .quad 0x03FAB46BD74DA76A0
+ .quad 0x03FABBCEBFC68F424 # 0.054175734102 57
+ .quad 0x03FABBCEBFC68F424
+ .quad 0x03FAC3335D1BBAE4D # 0.055078203060 58
+ .quad 0x03FAC3335D1BBAE4D
+ .quad 0x03FACBA87200EB8F1 # 0.056110594428 59
+ .quad 0x03FACBA87200EB8F1
+ .quad 0x03FAD310BA20455A2 # 0.057014812019 60
+ .quad 0x03FAD310BA20455A2
+ .quad 0x03FADA7AB998B77ED # 0.057919847959 61
+ .quad 0x03FADA7AB998B77ED
+ .quad 0x03FAE1E6713606CFB # 0.058825703731 62
+ .quad 0x03FAE1E6713606CFB
+ .quad 0x03FAE953E1C48603A # 0.059732380822 63
+ .quad 0x03FAE953E1C48603A
+ .quad 0x03FAF0C30C1116351 # 0.060639880722 64
+ .quad 0x03FAF0C30C1116351
+ .quad 0x03FAF833F0E927711 # 0.061548204926 65
+ .quad 0x03FAF833F0E927711
+ .quad 0x03FAFFA6911AB9309 # 0.062457354934 66
+ .quad 0x03FAFFA6911AB9309
+ .quad 0x03FB038D76BA2D737 # 0.063367332247 67
+ .quad 0x03FB038D76BA2D737
+ .quad 0x03FB0748836296412 # 0.064278138373 68
+ .quad 0x03FB0748836296412
+ .quad 0x03FB0B046EEE6F7A4 # 0.065189774824 69
+ .quad 0x03FB0B046EEE6F7A4
+ .quad 0x03FB0EC139C5DA5FD # 0.066102243114 70
+ .quad 0x03FB0EC139C5DA5FD
+ .quad 0x03FB127EE451413A8 # 0.067015544762 71
+ .quad 0x03FB127EE451413A8
+ .quad 0x03FB163D6EF9579FC # 0.067929681294 72
+ .quad 0x03FB163D6EF9579FC
+ .quad 0x03FB19FCDA271ABC0 # 0.068844654235 73
+ .quad 0x03FB19FCDA271ABC0
+ .quad 0x03FB1DBD2643D1912 # 0.069760465119 74
+ .quad 0x03FB1DBD2643D1912
+ .quad 0x03FB217E53B90D3CE # 0.070677115481 75
+ .quad 0x03FB217E53B90D3CE
+ .quad 0x03FB254062F0A9417 # 0.071594606862 76
+ .quad 0x03FB254062F0A9417
+ .quad 0x03FB29035454CBCB0 # 0.072512940806 77
+ .quad 0x03FB29035454CBCB0
+ .quad 0x03FB2CC7284FE5F1A # 0.073432118863 78
+ .quad 0x03FB2CC7284FE5F1A
+ .quad 0x03FB308BDF4CB4062 # 0.074352142586 79
+ .quad 0x03FB308BDF4CB4062
+ .quad 0x03FB345179B63DD3F # 0.075273013532 80
+ .quad 0x03FB345179B63DD3F
+ .quad 0x03FB3817F7F7D6EAB # 0.076194733263 81
+ .quad 0x03FB3817F7F7D6EAB
+ .quad 0x03FB3BDF5A7D1EE5E # 0.077117303344 82
+ .quad 0x03FB3BDF5A7D1EE5E
+ .quad 0x03FB3F1D405CE86D3 # 0.077908755701 83
+ .quad 0x03FB3F1D405CE86D3
+ .quad 0x03FB42E64BEC266E4 # 0.078832909176 84
+ .quad 0x03FB42E64BEC266E4
+ .quad 0x03FB46B03CF437BC4 # 0.079757917501 85
+ .quad 0x03FB46B03CF437BC4
+ .quad 0x03FB4A7B13E1E3E65 # 0.080683782259 86
+ .quad 0x03FB4A7B13E1E3E65
+ .quad 0x03FB4E46D1223FE84 # 0.081610505036 87
+ .quad 0x03FB4E46D1223FE84
+ .quad 0x03FB52137522AE732 # 0.082538087426 88
+ .quad 0x03FB52137522AE732
+ .quad 0x03FB5555DE434F2A0 # 0.083333843436 89
+ .quad 0x03FB5555DE434F2A0
+ .quad 0x03FB59242FF043D34 # 0.084263026485 90
+ .quad 0x03FB59242FF043D34
+ .quad 0x03FB5CF36997817B2 # 0.085193073719 91
+ .quad 0x03FB5CF36997817B2
+ .quad 0x03FB60C38BA799459 # 0.086123986746 92
+ .quad 0x03FB60C38BA799459
+ .quad 0x03FB6408F471C82A2 # 0.086922602521 93
+ .quad 0x03FB6408F471C82A2
+ .quad 0x03FB67DAC7466CB96 # 0.087855127734 94
+ .quad 0x03FB67DAC7466CB96
+ .quad 0x03FB6BAD83C1883BA # 0.088788523361 95
+ .quad 0x03FB6BAD83C1883BA
+ .quad 0x03FB6EF528C056A2D # 0.089589270768 96
+ .quad 0x03FB6EF528C056A2D
+ .quad 0x03FB72C9985035BB1 # 0.090524287199 97
+ .quad 0x03FB72C9985035BB1
+ .quad 0x03FB769EF2C6B5688 # 0.091460178704 98
+ .quad 0x03FB769EF2C6B5688
+ .quad 0x03FB79E8D70A364C6 # 0.092263069152 99
+ .quad 0x03FB79E8D70A364C6
+ .quad 0x03FB7DBFE6EA733FE # 0.093200590148 100
+ .quad 0x03FB7DBFE6EA733FE
+ .quad 0x03FB8197E2F40E3F0 # 0.094138990914 101
+ .quad 0x03FB8197E2F40E3F0
+ .quad 0x03FB84E40992A4804 # 0.094944035906 102
+ .quad 0x03FB84E40992A4804
+ .quad 0x03FB88BDBD5FC66D2 # 0.095884074919 103
+ .quad 0x03FB88BDBD5FC66D2
+ .quad 0x03FB8C985E9B9EC7E # 0.096824998438 104
+ .quad 0x03FB8C985E9B9EC7E
+ .quad 0x03FB8FE6CAB20E979 # 0.097632209567 105
+ .quad 0x03FB8FE6CAB20E979
+ .quad 0x03FB93C3261014C65 # 0.098574780162 106
+ .quad 0x03FB93C3261014C65
+ .quad 0x03FB97130DC9235DE # 0.099383405543 107
+ .quad 0x03FB97130DC9235DE
+ .quad 0x03FB9AF124D64C623 # 0.100327628989 108
+ .quad 0x03FB9AF124D64C623
+ .quad 0x03FB9E4289871E964 # 0.101137673586 109
+ .quad 0x03FB9E4289871E964
+ .quad 0x03FBA2225DD276FCB # 0.102083555691 110
+ .quad 0x03FBA2225DD276FCB
+ .quad 0x03FBA57540D1FE441 # 0.102895024494 111
+ .quad 0x03FBA57540D1FE441
+ .quad 0x03FBA956D3ECADE60 # 0.103842571097 112
+ .quad 0x03FBA956D3ECADE60
+ .quad 0x03FBACAB3693AB9C0 # 0.104655469123 113
+ .quad 0x03FBACAB3693AB9C0
+ .quad 0x03FBB08E8A10F96F4 # 0.105604686090 114
+ .quad 0x03FBB08E8A10F96F4
+ .quad 0x03FBB3E46DBA02181 # 0.106419018383 115
+ .quad 0x03FBB3E46DBA02181
+ .quad 0x03FBB7C9832F58018 # 0.107369911615 116
+ .quad 0x03FBB7C9832F58018
+ .quad 0x03FBBB20E936D6976 # 0.108185683244 117
+ .quad 0x03FBBB20E936D6976
+ .quad 0x03FBBF07C23BC54EA # 0.109138258671 118
+ .quad 0x03FBBF07C23BC54EA
+ .quad 0x03FBC260ABFFFE972 # 0.109955474734 119
+ .quad 0x03FBC260ABFFFE972
+ .quad 0x03FBC6494A2E418A0 # 0.110909738320 120
+ .quad 0x03FBC6494A2E418A0
+ .quad 0x03FBC9A3B90F57748 # 0.111728403941 121
+ .quad 0x03FBC9A3B90F57748
+ .quad 0x03FBCCFEDBFEE13A8 # 0.112547740324 122
+ .quad 0x03FBCCFEDBFEE13A8
+ .quad 0x03FBD0EA1362CDBFC # 0.113504482008 123
+ .quad 0x03FBD0EA1362CDBFC
+ .quad 0x03FBD446BD753D433 # 0.114325275488 124
+ .quad 0x03FBD446BD753D433
+ .quad 0x03FBD7A41C8627307 # 0.115146743223 125
+ .quad 0x03FBD7A41C8627307
+ .quad 0x03FBDB91F09680DF9 # 0.116105975911 126
+ .quad 0x03FBDB91F09680DF9
+ .quad 0x03FBDEF0D8D466DBB # 0.116928908339 127
+ .quad 0x03FBDEF0D8D466DBB
+ .quad 0x03FBE2507702AF03B # 0.117752518544 128
+ .quad 0x03FBE2507702AF03B
+ .quad 0x03FBE640EB3D2B411 # 0.118714255240 129
+ .quad 0x03FBE640EB3D2B411
+ .quad 0x03FBE9A214A69DD58 # 0.119539337795 130
+ .quad 0x03FBE9A214A69DD58
+ .quad 0x03FBED03F4F440969 # 0.120365101673 131
+ .quad 0x03FBED03F4F440969
+ .quad 0x03FBF0F70CDD992E4 # 0.121329355484 132
+ .quad 0x03FBF0F70CDD992E4
+ .quad 0x03FBF45A7A78B7C3B # 0.122156599431 133
+ .quad 0x03FBF45A7A78B7C3B
+ .quad 0x03FBF7BE9FEDBFDED # 0.122984528276 134
+ .quad 0x03FBF7BE9FEDBFDED
+ .quad 0x03FBFB237D8AB13FB # 0.123813143156 135
+ .quad 0x03FBFB237D8AB13FB
+ .quad 0x03FBFF1A13EAC95FD # 0.124780729104 136
+ .quad 0x03FBFF1A13EAC95FD
+ .quad 0x03FC014040CAB0229 # 0.125610834299 137
+ .quad 0x03FC014040CAB0229
+ .quad 0x03FC02F3D4301417B # 0.126441629140 138
+ .quad 0x03FC02F3D4301417B
+ .quad 0x03FC04A7C44CF87A4 # 0.127273114776 139
+ .quad 0x03FC04A7C44CF87A4
+ .quad 0x03FC06A4D1D26C5E9 # 0.128244055971 140
+ .quad 0x03FC06A4D1D26C5E9
+ .quad 0x03FC08598B59E3A07 # 0.129077042275 141
+ .quad 0x03FC08598B59E3A07
+ .quad 0x03FC0A0EA2164AF02 # 0.129910723024 142
+ .quad 0x03FC0A0EA2164AF02
+ .quad 0x03FC0BC4162F73B66 # 0.130745099376 143
+ .quad 0x03FC0BC4162F73B66
+ .quad 0x03FC0D79E7CD48E58 # 0.131580172493 144
+ .quad 0x03FC0D79E7CD48E58
+ .quad 0x03FC0F301717CF0FB # 0.132415943541 145
+ .quad 0x03FC0F301717CF0FB
+ .quad 0x03FC10E6A437247B7 # 0.133252413686 146
+ .quad 0x03FC10E6A437247B7
+ .quad 0x03FC12E6BFA8FEAD6 # 0.134229180665 147
+ .quad 0x03FC12E6BFA8FEAD6
+ .quad 0x03FC149E189F8642E # 0.135067169541 148
+ .quad 0x03FC149E189F8642E
+ .quad 0x03FC1655CFEA923A4 # 0.135905861231 149
+ .quad 0x03FC1655CFEA923A4
+ .quad 0x03FC180DE5B2ACE5C # 0.136745256915 150
+ .quad 0x03FC180DE5B2ACE5C
+ .quad 0x03FC19C65A207AC07 # 0.137585357777 151
+ .quad 0x03FC19C65A207AC07
+ .quad 0x03FC1B7F2D5CBA842 # 0.138426165001 152
+ .quad 0x03FC1B7F2D5CBA842
+ .quad 0x03FC1D385F90453F2 # 0.139267679777 153
+ .quad 0x03FC1D385F90453F2
+ .quad 0x03FC1EF1F0E40E6CD # 0.140109903297 154
+ .quad 0x03FC1EF1F0E40E6CD
+ .quad 0x03FC20ABE18124098 # 0.140952836755 155
+ .quad 0x03FC20ABE18124098
+ .quad 0x03FC22663190AEACC # 0.141796481350 156
+ .quad 0x03FC22663190AEACC
+ .quad 0x03FC2420E13BF19E3 # 0.142640838281 157
+ .quad 0x03FC2420E13BF19E3
+ .quad 0x03FC25DBF0AC4AED2 # 0.143485908754 158
+ .quad 0x03FC25DBF0AC4AED2
+ .quad 0x03FC2797600B3387B # 0.144331693975 159
+ .quad 0x03FC2797600B3387B
+ .quad 0x03FC29532F823F525 # 0.145178195155 160
+ .quad 0x03FC29532F823F525
+ .quad 0x03FC2B0F5F3B1D3EF # 0.146025413505 161
+ .quad 0x03FC2B0F5F3B1D3EF
+ .quad 0x03FC2CCBEF5F97653 # 0.146873350243 162
+ .quad 0x03FC2CCBEF5F97653
+ .quad 0x03FC2E88E01993187 # 0.147722006588 163
+ .quad 0x03FC2E88E01993187
+ .quad 0x03FC3046319311009 # 0.148571383763 164
+ .quad 0x03FC3046319311009
+ .quad 0x03FC3203E3F62D328 # 0.149421482992 165
+ .quad 0x03FC3203E3F62D328
+ .quad 0x03FC33C1F76D1F469 # 0.150272305505 166
+ .quad 0x03FC33C1F76D1F469
+ .quad 0x03FC35806C223A70F # 0.151123852534 167
+ .quad 0x03FC35806C223A70F
+ .quad 0x03FC373F423FED9A1 # 0.151976125313 168
+ .quad 0x03FC373F423FED9A1
+ .quad 0x03FC38FE79F0C3771 # 0.152829125080 169
+ .quad 0x03FC38FE79F0C3771
+ .quad 0x03FC3ABE135F62A12 # 0.153682853077 170
+ .quad 0x03FC3ABE135F62A12
+ .quad 0x03FC3C335E0447D71 # 0.154394850259 171
+ .quad 0x03FC3C335E0447D71
+ .quad 0x03FC3DF3AB13505F9 # 0.155249916579 172
+ .quad 0x03FC3DF3AB13505F9
+ .quad 0x03FC3FB45A59928CA # 0.156105714663 173
+ .quad 0x03FC3FB45A59928CA
+ .quad 0x03FC41756C0220C81 # 0.156962245765 174
+ .quad 0x03FC41756C0220C81
+ .quad 0x03FC4336E03829D61 # 0.157819511141 175
+ .quad 0x03FC4336E03829D61
+ .quad 0x03FC44F8B726F8EFE # 0.158677512051 176
+ .quad 0x03FC44F8B726F8EFE
+ .quad 0x03FC46BAF0F9F5DB8 # 0.159536249760 177
+ .quad 0x03FC46BAF0F9F5DB8
+ .quad 0x03FC48326CD3EC797 # 0.160252428262 178
+ .quad 0x03FC48326CD3EC797
+ .quad 0x03FC49F55C6502F81 # 0.161112520058 179
+ .quad 0x03FC49F55C6502F81
+ .quad 0x03FC4BB8AF55DE908 # 0.161973352249 180
+ .quad 0x03FC4BB8AF55DE908
+ .quad 0x03FC4D7C65D25566D # 0.162834926111 181
+ .quad 0x03FC4D7C65D25566D
+ .quad 0x03FC4F4080065AA7F # 0.163697242922 182
+ .quad 0x03FC4F4080065AA7F
+ .quad 0x03FC50B98CD30A759 # 0.164416408720 183
+ .quad 0x03FC50B98CD30A759
+ .quad 0x03FC527E5E4A1B58D # 0.165280090939 184
+ .quad 0x03FC527E5E4A1B58D
+ .quad 0x03FC544393F5DF80F # 0.166144519750 185
+ .quad 0x03FC544393F5DF80F
+ .quad 0x03FC56092E02BA514 # 0.167009696444 186
+ .quad 0x03FC56092E02BA514
+ .quad 0x03FC57837B3098F2C # 0.167731249257 187
+ .quad 0x03FC57837B3098F2C
+ .quad 0x03FC5949CDB873419 # 0.168597800437 188
+ .quad 0x03FC5949CDB873419
+ .quad 0x03FC5B10851FC924A # 0.169465103180 189
+ .quad 0x03FC5B10851FC924A
+ .quad 0x03FC5C8BC079D8289 # 0.170188430518 190
+ .quad 0x03FC5C8BC079D8289
+ .quad 0x03FC5E533144C1718 # 0.171057114516 191
+ .quad 0x03FC5E533144C1718
+ .quad 0x03FC601B076E7A8A8 # 0.171926553783 192
+ .quad 0x03FC601B076E7A8A8
+ .quad 0x03FC619732215D786 # 0.172651664394 193
+ .quad 0x03FC619732215D786
+ .quad 0x03FC635FC298F6C77 # 0.173522491735 194
+ .quad 0x03FC635FC298F6C77
+ .quad 0x03FC6528B8EFA5D16 # 0.174394078077 195
+ .quad 0x03FC6528B8EFA5D16
+ .quad 0x03FC66A5D42A3AD33 # 0.175120980777 196
+ .quad 0x03FC66A5D42A3AD33
+ .quad 0x03FC686F85BAD4298 # 0.175993962063 197
+ .quad 0x03FC686F85BAD4298
+ .quad 0x03FC6A399DABBD383 # 0.176867706111 198
+ .quad 0x03FC6A399DABBD383
+ .quad 0x03FC6BB7AA9F22C40 # 0.177596409780 199
+ .quad 0x03FC6BB7AA9F22C40
+ .quad 0x03FC6D827EB7C1E57 # 0.178471555693 200
+ .quad 0x03FC6D827EB7C1E57
+ .quad 0x03FC6F0128B756AB9 # 0.179201429458 201
+ .quad 0x03FC6F0128B756AB9
+ .quad 0x03FC70CCB9927BCF6 # 0.180077981742 202
+ .quad 0x03FC70CCB9927BCF6
+ .quad 0x03FC7298B1A4E32B6 # 0.180955303044 203
+ .quad 0x03FC7298B1A4E32B6
+ .quad 0x03FC74184F58CC7DC # 0.181686992547 204
+ .quad 0x03FC74184F58CC7DC
+ .quad 0x03FC75E5051E74141 # 0.182565727226 205
+ .quad 0x03FC75E5051E74141
+ .quad 0x03FC77654128F6127 # 0.183298596442 206
+ .quad 0x03FC77654128F6127
+ .quad 0x03FC7932B53E97639 # 0.184178749058 207
+ .quad 0x03FC7932B53E97639
+ .quad 0x03FC7AB390229D8FD # 0.184912801796 208
+ .quad 0x03FC7AB390229D8FD
+ .quad 0x03FC7C81C325B4A5E # 0.185794376934 209
+ .quad 0x03FC7C81C325B4A5E
+ .quad 0x03FC7E033D66CD24A # 0.186529617023 210
+ .quad 0x03FC7E033D66CD24A
+ .quad 0x03FC7FD22FF599D4C # 0.187412619288 211
+ .quad 0x03FC7FD22FF599D4C
+ .quad 0x03FC81544A17F67C1 # 0.188149050576 212
+ .quad 0x03FC81544A17F67C1
+ .quad 0x03FC8323FCD17DAC8 # 0.189033484595 213
+ .quad 0x03FC8323FCD17DAC8
+ .quad 0x03FC84A6B759F512D # 0.189771110947 214
+ .quad 0x03FC84A6B759F512D
+ .quad 0x03FC86772ADE0201C # 0.190656981373 215
+ .quad 0x03FC86772ADE0201C
+ .quad 0x03FC87FA865210911 # 0.191395806674 216
+ .quad 0x03FC87FA865210911
+ .quad 0x03FC89CBBB4136201 # 0.192283118179 217
+ .quad 0x03FC89CBBB4136201
+ .quad 0x03FC8B4FB826FF291 # 0.193023146334 218
+ .quad 0x03FC8B4FB826FF291
+ .quad 0x03FC8D21AF2299298 # 0.193911903613 219
+ .quad 0x03FC8D21AF2299298
+ .quad 0x03FC8EA64E00E7FC0 # 0.194653138545 220
+ .quad 0x03FC8EA64E00E7FC0
+ .quad 0x03FC902B36AB7681D # 0.195394923313 221
+ .quad 0x03FC902B36AB7681D
+ .quad 0x03FC91FE49096581E # 0.196285791969 222
+ .quad 0x03FC91FE49096581E
+ .quad 0x03FC9383D471B869B # 0.197028789254 223
+ .quad 0x03FC9383D471B869B
+ .quad 0x03FC9557AA6B87F65 # 0.197921115309 224
+ .quad 0x03FC9557AA6B87F65
+ .quad 0x03FC96DDD91A0B959 # 0.198665329082 225
+ .quad 0x03FC96DDD91A0B959
+ .quad 0x03FC9864522D04491 # 0.199410097121 226
+ .quad 0x03FC9864522D04491
+ .quad 0x03FC9A3945D1A44B3 # 0.200304551564 227
+ .quad 0x03FC9A3945D1A44B3
+ .quad 0x03FC9BC062F26FC3B # 0.201050541900 228
+ .quad 0x03FC9BC062F26FC3B
+ .quad 0x03FC9D47CAD2C1871 # 0.201797089154 229
+ .quad 0x03FC9D47CAD2C1871
+ .quad 0x03FC9F1DDD7FE4F8B # 0.202693682161 230
+ .quad 0x03FC9F1DDD7FE4F8B
+ .quad 0x03FCA0A5EA371A910 # 0.203441457564 231
+ .quad 0x03FCA0A5EA371A910
+ .quad 0x03FCA22E42098F498 # 0.204189792554 232
+ .quad 0x03FCA22E42098F498
+ .quad 0x03FCA405751F6CCE4 # 0.205088534376 233
+ .quad 0x03FCA405751F6CCE4
+ .quad 0x03FCA58E729348F40 # 0.205838103409 234
+ .quad 0x03FCA58E729348F40
+ .quad 0x03FCA717BB7EC64A3 # 0.206588234717 235
+ .quad 0x03FCA717BB7EC64A3
+ .quad 0x03FCA8F010601E5FD # 0.207489135679 236
+ .quad 0x03FCA8F010601E5FD
+ .quad 0x03FCAA79FFB8FCD48 # 0.208240506966 237
+ .quad 0x03FCAA79FFB8FCD48
+ .quad 0x03FCAC043AE68965A # 0.208992443238 238
+ .quad 0x03FCAC043AE68965A
+ .quad 0x03FCAD8EC205FB6AD # 0.209744945343 239
+ .quad 0x03FCAD8EC205FB6AD
+ .quad 0x03FCAF6895610DBAD # 0.210648695969 240
+ .quad 0x03FCAF6895610DBAD
+ .quad 0x03FCB0F3C3FBD65C9 # 0.211402445910 241
+ .quad 0x03FCB0F3C3FBD65C9
+ .quad 0x03FCB27F3EE674219 # 0.212156764419 242
+ .quad 0x03FCB27F3EE674219
+ .quad 0x03FCB40B063E65B0F # 0.212911652354 243
+ .quad 0x03FCB40B063E65B0F
+ .quad 0x03FCB5E65A8096C88 # 0.213818270730 244
+ .quad 0x03FCB5E65A8096C88
+ .quad 0x03FCB772CA646760C # 0.214574414434 245
+ .quad 0x03FCB772CA646760C
+ .quad 0x03FCB8FF871461198 # 0.215331130323 246
+ .quad 0x03FCB8FF871461198
+ .quad 0x03FCBA8C90AE4AD19 # 0.216088419265 247
+ .quad 0x03FCBA8C90AE4AD19
+ .quad 0x03FCBC19E74FFCBDA # 0.216846282128 248
+ .quad 0x03FCBC19E74FFCBDA
+ .quad 0x03FCBDF71B83DAE7A # 0.217756476365 249
+ .quad 0x03FCBDF71B83DAE7A
+ .quad 0x03FCBF851C067555C # 0.218515604922 250
+ .quad 0x03FCBF851C067555C
+ .quad 0x03FCC11369F0CDB3C # 0.219275310193 251
+ .quad 0x03FCC11369F0CDB3C
+ .quad 0x03FCC2A205610593E # 0.220035593055 252
+ .quad 0x03FCC2A205610593E
+ .quad 0x03FCC430EE755023B # 0.220796454387 253
+ .quad 0x03FCC430EE755023B
+ .quad 0x03FCC5C0254BF23A8 # 0.221557895069 254
+ .quad 0x03FCC5C0254BF23A8
+ .quad 0x03FCC79F9AB632BF1 # 0.222472389875 255
+ .quad 0x03FCC79F9AB632BF1
+ .quad 0x03FCC92F7D09ABE20 # 0.223235108240 256
+ .quad 0x03FCC92F7D09ABE20
+ .quad 0x03FCCABFAD80D023D # 0.223998408788 257
+ .quad 0x03FCCABFAD80D023D
+ .quad 0x03FCCC502C3A2F1E8 # 0.224762292410 258
+ .quad 0x03FCCC502C3A2F1E8
+ .quad 0x03FCCDE0F9546A5E7 # 0.225526759995 259
+ .quad 0x03FCCDE0F9546A5E7
+ .quad 0x03FCCF7214EE356E9 # 0.226291812439 260
+ .quad 0x03FCCF7214EE356E9
+ .quad 0x03FCD1037F2655E7B # 0.227057450635 261
+ .quad 0x03FCD1037F2655E7B
+ .quad 0x03FCD295381BA37E9 # 0.227823675483 262
+ .quad 0x03FCD295381BA37E9
+ .quad 0x03FCD4273FED08111 # 0.228590487882 263
+ .quad 0x03FCD4273FED08111
+ .quad 0x03FCD5B996B97FB5F # 0.229357888733 264
+ .quad 0x03FCD5B996B97FB5F
+ .quad 0x03FCD74C3CA018C9C # 0.230125878940 265
+ .quad 0x03FCD74C3CA018C9C
+ .quad 0x03FCD8DF31BFF3FF2 # 0.230894459410 266
+ .quad 0x03FCD8DF31BFF3FF2
+ .quad 0x03FCDA727638446A1 # 0.231663631050 267
+ .quad 0x03FCDA727638446A1
+ .quad 0x03FCDC56CAE452F5B # 0.232587418645 268
+ .quad 0x03FCDC56CAE452F5B
+ .quad 0x03FCDDEABE5A3926E # 0.233357894066 269
+ .quad 0x03FCDDEABE5A3926E
+ .quad 0x03FCDF7F018CE771F # 0.234128963578 270
+ .quad 0x03FCDF7F018CE771F
+ .quad 0x03FCE113949BDEC62 # 0.234900628096 271
+ .quad 0x03FCE113949BDEC62
+ .quad 0x03FCE2A877A6B2C0F # 0.235672888541 272
+ .quad 0x03FCE2A877A6B2C0F
+ .quad 0x03FCE43DAACD09BEC # 0.236445745833 273
+ .quad 0x03FCE43DAACD09BEC
+ .quad 0x03FCE5D32E2E9CE87 # 0.237219200895 274
+ .quad 0x03FCE5D32E2E9CE87
+ .quad 0x03FCE76901EB38427 # 0.237993254653 275
+ .quad 0x03FCE76901EB38427
+ .quad 0x03FCE8ADE53F76866 # 0.238612929343 276
+ .quad 0x03FCE8ADE53F76866
+ .quad 0x03FCEA4449F04AAF4 # 0.239388063093 277
+ .quad 0x03FCEA4449F04AAF4
+ .quad 0x03FCEBDAFF5593E99 # 0.240163798141 278
+ .quad 0x03FCEBDAFF5593E99
+ .quad 0x03FCED72058F666C5 # 0.240940135421 279
+ .quad 0x03FCED72058F666C5
+ .quad 0x03FCEF095CBDE9937 # 0.241717075868 280
+ .quad 0x03FCEF095CBDE9937
+ .quad 0x03FCF0A1050157ED6 # 0.242494620422 281
+ .quad 0x03FCF0A1050157ED6
+ .quad 0x03FCF238FE79FF4BF # 0.243272770021 282
+ .quad 0x03FCF238FE79FF4BF
+ .quad 0x03FCF3D1494840D2F # 0.244051525609 283
+ .quad 0x03FCF3D1494840D2F
+ .quad 0x03FCF569E58C91077 # 0.244830888130 284
+ .quad 0x03FCF569E58C91077
+ .quad 0x03FCF702D36777DF0 # 0.245610858531 285
+ .quad 0x03FCF702D36777DF0
+ .quad 0x03FCF89C12F990D0C # 0.246391437760 286
+ .quad 0x03FCF89C12F990D0C
+ .quad 0x03FCFA35A4638AE2C # 0.247172626770 287
+ .quad 0x03FCFA35A4638AE2C
+ .quad 0x03FCFB7D86EEE3B92 # 0.247798017660 288
+ .quad 0x03FCFB7D86EEE3B92
+ .quad 0x03FCFD17ABFCDB683 # 0.248580306677 289
+ .quad 0x03FCFD17ABFCDB683
+ .quad 0x03FCFEB2233EA07CB # 0.249363208150 290
+ .quad 0x03FCFEB2233EA07CB
+ .quad 0x03FD0026766A9671C # 0.250146723037 291
+ .quad 0x03FD0026766A9671C
+ .quad 0x03FD00F40470C7323 # 0.250930852302 292
+ .quad 0x03FD00F40470C7323
+ .quad 0x03FD01C1BBC2735A3 # 0.251715596908 293
+ .quad 0x03FD01C1BBC2735A3
+ .quad 0x03FD028F9C7035C1D # 0.252500957822 294
+ .quad 0x03FD028F9C7035C1D
+ .quad 0x03FD03346E0106062 # 0.253129690945 295
+ .quad 0x03FD03346E0106062
+ .quad 0x03FD0402994B4F041 # 0.253916163656 296
+ .quad 0x03FD0402994B4F041
+ .quad 0x03FD04D0EE20620AF # 0.254703255393 297
+ .quad 0x03FD04D0EE20620AF
+ .quad 0x03FD059F6C910034D # 0.255490967131 298
+ .quad 0x03FD059F6C910034D
+ .quad 0x03FD066E14ADF4BFD # 0.256279299848 299
+ .quad 0x03FD066E14ADF4BFD
+ .quad 0x03FD07138604D5864 # 0.256910413785 300
+ .quad 0x03FD07138604D5864
+ .quad 0x03FD07E2794F3E8C1 # 0.257699866735 301
+ .quad 0x03FD07E2794F3E8C1
+ .quad 0x03FD08B196753A125 # 0.258489943414 302
+ .quad 0x03FD08B196753A125
+ .quad 0x03FD0980DD87BA2DD # 0.259280644807 303
+ .quad 0x03FD0980DD87BA2DD
+ .quad 0x03FD0A504E97BB40C # 0.260071971904 304
+ .quad 0x03FD0A504E97BB40C
+ .quad 0x03FD0AF660EB9E278 # 0.260705484754 305
+ .quad 0x03FD0AF660EB9E278
+ .quad 0x03FD0BC61DBBA97CB # 0.261497940616 306
+ .quad 0x03FD0BC61DBBA97CB
+ .quad 0x03FD0C9604B8FC51E # 0.262291024962 307
+ .quad 0x03FD0C9604B8FC51E
+ .quad 0x03FD0D3C7586CD5E5 # 0.262925945618 308
+ .quad 0x03FD0D3C7586CD5E5
+ .quad 0x03FD0E0CA89A72D29 # 0.263720163752 309
+ .quad 0x03FD0E0CA89A72D29
+ .quad 0x03FD0EDD060B78082 # 0.264515013170 310
+ .quad 0x03FD0EDD060B78082
+ .quad 0x03FD0FAD8DEB1E2C0 # 0.265310494876 311
+ .quad 0x03FD0FAD8DEB1E2C0
+ .quad 0x03FD10547F9D26ABC # 0.265947336165 312
+ .quad 0x03FD10547F9D26ABC
+ .quad 0x03FD1125540925114 # 0.266743958529 313
+ .quad 0x03FD1125540925114
+ .quad 0x03FD11F653144CB8B # 0.267541216005 314
+ .quad 0x03FD11F653144CB8B
+ .quad 0x03FD129DA43F5BE9E # 0.268179479949 315
+ .quad 0x03FD129DA43F5BE9E
+ .quad 0x03FD136EF02E8290C # 0.268977883185 316
+ .quad 0x03FD136EF02E8290C
+ .quad 0x03FD144066EDAE406 # 0.269776924378 317
+ .quad 0x03FD144066EDAE406
+ .quad 0x03FD14E817FF359D7 # 0.270416617347 318
+ .quad 0x03FD14E817FF359D7
+ .quad 0x03FD15B9DBFA9DEC8 # 0.271216809436 319
+ .quad 0x03FD15B9DBFA9DEC8
+ .quad 0x03FD168BCAF73B3EB # 0.272017642345 320
+ .quad 0x03FD168BCAF73B3EB
+ .quad 0x03FD1733DC5D68DE8 # 0.272658770753 321
+ .quad 0x03FD1733DC5D68DE8
+ .quad 0x03FD180618EF18ADE # 0.273460759729 322
+ .quad 0x03FD180618EF18ADE
+ .quad 0x03FD18D880B3826FE # 0.274263392407 323
+ .quad 0x03FD18D880B3826FE
+ .quad 0x03FD1980F2DD42B6F # 0.274905962710 324
+ .quad 0x03FD1980F2DD42B6F
+ .quad 0x03FD1A53A8902E70B # 0.275709756661 325
+ .quad 0x03FD1A53A8902E70B
+ .quad 0x03FD1AFC59297024D # 0.276353257326 326
+ .quad 0x03FD1AFC59297024D
+ .quad 0x03FD1BCF5D04AE1EA # 0.277158215914 327
+ .quad 0x03FD1BCF5D04AE1EA
+ .quad 0x03FD1CA28C64BAE54 # 0.277963822983 328
+ .quad 0x03FD1CA28C64BAE54
+ .quad 0x03FD1D4B9E796C245 # 0.278608776246 329
+ .quad 0x03FD1D4B9E796C245
+ .quad 0x03FD1E1F1C5C3A06C # 0.279415553216 330
+ .quad 0x03FD1E1F1C5C3A06C
+ .quad 0x03FD1EC86D5747AAD # 0.280061443760 331
+ .quad 0x03FD1EC86D5747AAD
+ .quad 0x03FD1F9C39F74C559 # 0.280869394034 332
+ .quad 0x03FD1F9C39F74C559
+ .quad 0x03FD2070326F1F789 # 0.281677997620 333
+ .quad 0x03FD2070326F1F789
+ .quad 0x03FD2119E59F8789C # 0.282325351583 334
+ .quad 0x03FD2119E59F8789C
+ .quad 0x03FD21EE2D300381C # 0.283135133796 335
+ .quad 0x03FD21EE2D300381C
+ .quad 0x03FD22981FBEF797A # 0.283783432036 336
+ .quad 0x03FD22981FBEF797A
+ .quad 0x03FD236CB6A339EED # 0.284594396317 337
+ .quad 0x03FD236CB6A339EED
+ .quad 0x03FD2416E8C01F606 # 0.285243641592 338
+ .quad 0x03FD2416E8C01F606
+ .quad 0x03FD24EBCF3387FF6 # 0.286055791397 339
+ .quad 0x03FD24EBCF3387FF6
+ .quad 0x03FD2596410DF963A # 0.286705986479 340
+ .quad 0x03FD2596410DF963A
+ .quad 0x03FD266B774C2AF55 # 0.287519325279 341
+ .quad 0x03FD266B774C2AF55
+ .quad 0x03FD27162913F873F # 0.288170472950 342
+ .quad 0x03FD27162913F873F
+ .quad 0x03FD27EBAF58D8C9C # 0.288985004232 343
+ .quad 0x03FD27EBAF58D8C9C
+ .quad 0x03FD2896A13E086A3 # 0.289637107288 344
+ .quad 0x03FD2896A13E086A3
+ .quad 0x03FD296C77C5C0E13 # 0.290452834554 345
+ .quad 0x03FD296C77C5C0E13
+ .quad 0x03FD2A17A9F88EDD2 # 0.291105895801 346
+ .quad 0x03FD2A17A9F88EDD2
+ .quad 0x03FD2AEDD0FF8CC2C # 0.291922822568 347
+ .quad 0x03FD2AEDD0FF8CC2C
+ .quad 0x03FD2B9943B06BD77 # 0.292576844829 348
+ .quad 0x03FD2B9943B06BD77
+ .quad 0x03FD2C6FBB7360D0E # 0.293394974630 349
+ .quad 0x03FD2C6FBB7360D0E
+ .quad 0x03FD2D1B6ED2FA90C # 0.294049960734 350
+ .quad 0x03FD2D1B6ED2FA90C
+ .quad 0x03FD2DC73F01B0DD4 # 0.294705376127 351
+ .quad 0x03FD2DC73F01B0DD4
+ .quad 0x03FD2E9E2BCE12286 # 0.295525249913 352
+ .quad 0x03FD2E9E2BCE12286
+ .quad 0x03FD2F4A3CF22EDC2 # 0.296181633264 353
+ .quad 0x03FD2F4A3CF22EDC2
+ .quad 0x03FD30217B1006601 # 0.297002718785 354
+ .quad 0x03FD30217B1006601
+ .quad 0x03FD30CDCD5ABA762 # 0.297660072959 355
+ .quad 0x03FD30CDCD5ABA762
+ .quad 0x03FD31A55D07A8590 # 0.298482373803 356
+ .quad 0x03FD31A55D07A8590
+ .quad 0x03FD3251F0AA5CC1A # 0.299140701674 357
+ .quad 0x03FD3251F0AA5CC1A
+ .quad 0x03FD32FEA167A6D70 # 0.299799463226 358
+ .quad 0x03FD32FEA167A6D70
+ .quad 0x03FD33D6A7509D491 # 0.300623525901 359
+ .quad 0x03FD33D6A7509D491
+ .quad 0x03FD348399ADA9D94 # 0.301283265328 360
+ .quad 0x03FD348399ADA9D94
+ .quad 0x03FD3530A9454ADC9 # 0.301943440298 361
+ .quad 0x03FD3530A9454ADC9
+ .quad 0x03FD360925EC44F5C # 0.302769272371 362
+ .quad 0x03FD360925EC44F5C
+ .quad 0x03FD36B6776BE1116 # 0.303430429420 363
+ .quad 0x03FD36B6776BE1116
+ .quad 0x03FD378F469437FB4 # 0.304257490918 364
+ .quad 0x03FD378F469437FB4
+ .quad 0x03FD383CDA2E14ECB # 0.304919632971 365
+ .quad 0x03FD383CDA2E14ECB
+ .quad 0x03FD38EA8B3924521 # 0.305582213748 366
+ .quad 0x03FD38EA8B3924521
+ .quad 0x03FD39C3D1FD60E74 # 0.306411057558 367
+ .quad 0x03FD39C3D1FD60E74
+ .quad 0x03FD3A71C56BB48C7 # 0.307074627589 368
+ .quad 0x03FD3A71C56BB48C7
+ .quad 0x03FD3B1FD66BC8D10 # 0.307738638238 369
+ .quad 0x03FD3B1FD66BC8D10
+ .quad 0x03FD3BF995502CB5C # 0.308569272059 370
+ .quad 0x03FD3BF995502CB5C
+ .quad 0x03FD3CA7E8FD01DF6 # 0.309234276240 371
+ .quad 0x03FD3CA7E8FD01DF6
+ .quad 0x03FD3D565A5C5BF11 # 0.309899722945 372
+ .quad 0x03FD3D565A5C5BF11
+ .quad 0x03FD3E3091E6049FB # 0.310732154526 373
+ .quad 0x03FD3E3091E6049FB
+ .quad 0x03FD3EDF463C1683E # 0.311398599069 374
+ .quad 0x03FD3EDF463C1683E
+ .quad 0x03FD3F8E1865A82DD # 0.312065488057 375
+ .quad 0x03FD3F8E1865A82DD
+ .quad 0x03FD403D086CEA79B # 0.312732822082 376
+ .quad 0x03FD403D086CEA79B
+ .quad 0x03FD4117DE854CA15 # 0.313567616354 377
+ .quad 0x03FD4117DE854CA15
+ .quad 0x03FD41C711E4BA15E # 0.314235953889 378
+ .quad 0x03FD41C711E4BA15E
+ .quad 0x03FD427663431B221 # 0.314904738398 379
+ .quad 0x03FD427663431B221
+ .quad 0x03FD4325D2AAB6F18 # 0.315573970480 380
+ .quad 0x03FD4325D2AAB6F18
+ .quad 0x03FD44014838E5513 # 0.316411140893 381
+ .quad 0x03FD44014838E5513
+ .quad 0x03FD44B0FB5AF4F44 # 0.317081382205 382
+ .quad 0x03FD44B0FB5AF4F44
+ .quad 0x03FD4560CCA7CB3B2 # 0.317752073041 383
+ .quad 0x03FD4560CCA7CB3B2
+ .quad 0x03FD4610BC29C5E18 # 0.318423214006 384
+ .quad 0x03FD4610BC29C5E18
+ .quad 0x03FD46ECD216CDCB5 # 0.319262774126 385
+ .quad 0x03FD46ECD216CDCB5
+ .quad 0x03FD479D05B65CB60 # 0.319934930091 386
+ .quad 0x03FD479D05B65CB60
+ .quad 0x03FD484D57ACE5A1A # 0.320607538154 387
+ .quad 0x03FD484D57ACE5A1A
+ .quad 0x03FD48FDC804DD1CB # 0.321280598924 388
+ .quad 0x03FD48FDC804DD1CB
+ .quad 0x03FD49DA7F3BCC420 # 0.322122562432 389
+ .quad 0x03FD49DA7F3BCC420
+ .quad 0x03FD4A8B341552B09 # 0.322796644021 390
+ .quad 0x03FD4A8B341552B09
+ .quad 0x03FD4B3C077267E9A # 0.323471180303 391
+ .quad 0x03FD4B3C077267E9A
+ .quad 0x03FD4BECF95D97914 # 0.324146171892 392
+ .quad 0x03FD4BECF95D97914
+ .quad 0x03FD4C9E09E172C3D # 0.324821619401 393
+ .quad 0x03FD4C9E09E172C3D
+ .quad 0x03FD4D4F3908901A0 # 0.325497523449 394
+ .quad 0x03FD4D4F3908901A0
+ .quad 0x03FD4E2CDF1F341C1 # 0.326343046455 395
+ .quad 0x03FD4E2CDF1F341C1
+ .quad 0x03FD4EDE535C79642 # 0.327019979972 396
+ .quad 0x03FD4EDE535C79642
+ .quad 0x03FD4F8FE65F90500 # 0.327697372039 397
+ .quad 0x03FD4F8FE65F90500
+ .quad 0x03FD5041983326F2D # 0.328375223276 398
+ .quad 0x03FD5041983326F2D
+ .quad 0x03FD50F368E1F0F02 # 0.329053534308 399
+ .quad 0x03FD50F368E1F0F02
+ .quad 0x03FD51A55876A77F5 # 0.329732305758 400
+ .quad 0x03FD51A55876A77F5
+ .quad 0x03FD5283EF743F98B # 0.330581418486 401
+ .quad 0x03FD5283EF743F98B
+ .quad 0x03FD533624B59CA35 # 0.331261228165 402
+ .quad 0x03FD533624B59CA35
+ .quad 0x03FD53E878FFE6EAE # 0.331941500300 403
+ .quad 0x03FD53E878FFE6EAE
+ .quad 0x03FD549AEC5DEF880 # 0.332622235521 404
+ .quad 0x03FD549AEC5DEF880
+ .quad 0x03FD554D7EDA8D3C4 # 0.333303434457 405
+ .quad 0x03FD554D7EDA8D3C4
+ .quad 0x03FD560030809C759 # 0.333985097742 406
+ .quad 0x03FD560030809C759
+ .quad 0x03FD56B3015AFF52C # 0.334667226008 407
+ .quad 0x03FD56B3015AFF52C
+ .quad 0x03FD5765F1749DA6C # 0.335349819892 408
+ .quad 0x03FD5765F1749DA6C
+ .quad 0x03FD581900D864FD7 # 0.336032880027 409
+ .quad 0x03FD581900D864FD7
+ .quad 0x03FD58CC2F91489F5 # 0.336716407053 410
+ .quad 0x03FD58CC2F91489F5
+ .quad 0x03FD59AC5618CCE38 # 0.337571473373 411
+ .quad 0x03FD59AC5618CCE38
+ .quad 0x03FD5A5FCB795780C # 0.338256053239 412
+ .quad 0x03FD5A5FCB795780C
+ .quad 0x03FD5B136052BCE39 # 0.338941102075 413
+ .quad 0x03FD5B136052BCE39
+ .quad 0x03FD5BC714B008E23 # 0.339626620526 414
+ .quad 0x03FD5BC714B008E23
+ .quad 0x03FD5C7AE89C4D254 # 0.340312609234 415
+ .quad 0x03FD5C7AE89C4D254
+ .quad 0x03FD5D2EDC22A12BA # 0.340999068845 416
+ .quad 0x03FD5D2EDC22A12BA
+ .quad 0x03FD5DE2EF4E224D6 # 0.341686000008 417
+ .quad 0x03FD5DE2EF4E224D6
+ .quad 0x03FD5E972229F3C15 # 0.342373403369 418
+ .quad 0x03FD5E972229F3C15
+ .quad 0x03FD5F4B74C13EA04 # 0.343061279578 419
+ .quad 0x03FD5F4B74C13EA04
+ .quad 0x03FD5FFFE71F31E9A # 0.343749629287 420
+ .quad 0x03FD5FFFE71F31E9A
+ .quad 0x03FD60B4794F02875 # 0.344438453147 421
+ .quad 0x03FD60B4794F02875
+ .quad 0x03FD61692B5BEB520 # 0.345127751813 422
+ .quad 0x03FD61692B5BEB520
+ .quad 0x03FD621DFD512D14F # 0.345817525940 423
+ .quad 0x03FD621DFD512D14F
+ .quad 0x03FD62D2EF3A0E933 # 0.346507776183 424
+ .quad 0x03FD62D2EF3A0E933
+ .quad 0x03FD63880121DC8AB # 0.347198503200 425
+ .quad 0x03FD63880121DC8AB
+ .quad 0x03FD643D3313E9B92 # 0.347889707652 426
+ .quad 0x03FD643D3313E9B92
+ .quad 0x03FD64F2851B8EE01 # 0.348581390197 427
+ .quad 0x03FD64F2851B8EE01
+ .quad 0x03FD65A7F7442AC90 # 0.349273551498 428
+ .quad 0x03FD65A7F7442AC90
+ .quad 0x03FD665D8999224A5 # 0.349966192218 429
+ .quad 0x03FD665D8999224A5
+ .quad 0x03FD67133C25E04A5 # 0.350659313022 430
+ .quad 0x03FD67133C25E04A5
+ .quad 0x03FD67C90EF5D5C4C # 0.351352914576 431
+ .quad 0x03FD67C90EF5D5C4C
+ .quad 0x03FD687F021479CEE # 0.352046997547 432
+ .quad 0x03FD687F021479CEE
+ .quad 0x03FD6935158D499B3 # 0.352741562603 433
+ .quad 0x03FD6935158D499B3
+ .quad 0x03FD69EB496BC87E5 # 0.353436610416 434
+ .quad 0x03FD69EB496BC87E5
+ .quad 0x03FD6AA19DBB7FF34 # 0.354132141656 435
+ .quad 0x03FD6AA19DBB7FF34
+ .quad 0x03FD6B581287FF9FD # 0.354828156996 436
+ .quad 0x03FD6B581287FF9FD
+ .quad 0x03FD6C0EA7DCDD591 # 0.355524657112 437
+ .quad 0x03FD6C0EA7DCDD591
+ .quad 0x03FD6C97AD3CFCFD9 # 0.356047350738 438
+ .quad 0x03FD6C97AD3CFCFD9
+ .quad 0x03FD6D4E7B9C727EC # 0.356744700836 439
+ .quad 0x03FD6D4E7B9C727EC
+ .quad 0x03FD6E056AA4421D6 # 0.357442537571 440
+ .quad 0x03FD6E056AA4421D6
+ .quad 0x03FD6EBC7A6019066 # 0.358140861621 441
+ .quad 0x03FD6EBC7A6019066
+ .quad 0x03FD6F73AADBAAAB7 # 0.358839673669 442
+ .quad 0x03FD6F73AADBAAAB7
+ .quad 0x03FD702AFC22B0C6D # 0.359538974397 443
+ .quad 0x03FD702AFC22B0C6D
+ .quad 0x03FD70E26E40EB5FA # 0.360238764489 444
+ .quad 0x03FD70E26E40EB5FA
+ .quad 0x03FD719A014220CF5 # 0.360939044629 445
+ .quad 0x03FD719A014220CF5
+ .quad 0x03FD7251B5321DC54 # 0.361639815506 446
+ .quad 0x03FD7251B5321DC54
+ .quad 0x03FD73098A1CB54BA # 0.362341077807 447
+ .quad 0x03FD73098A1CB54BA
+ .quad 0x03FD73937F783CEBA # 0.362867347444 448
+ .quad 0x03FD73937F783CEBA
+ .quad 0x03FD744B8E35E9EDA # 0.363569471398 449
+ .quad 0x03FD744B8E35E9EDA
+ .quad 0x03FD7503BE0ED6C66 # 0.364272088676 450
+ .quad 0x03FD7503BE0ED6C66
+ .quad 0x03FD75BC0F0EEE7DE # 0.364975199972 451
+ .quad 0x03FD75BC0F0EEE7DE
+ .quad 0x03FD76748142228C7 # 0.365678805982 452
+ .quad 0x03FD76748142228C7
+ .quad 0x03FD772D14B46AE00 # 0.366382907402 453
+ .quad 0x03FD772D14B46AE00
+ .quad 0x03FD77E5C971C5E06 # 0.367087504930 454
+ .quad 0x03FD77E5C971C5E06
+ .quad 0x03FD787066E04915F # 0.367616279067 455
+ .quad 0x03FD787066E04915F
+ .quad 0x03FD792955FDF47A3 # 0.368321746469 456
+ .quad 0x03FD792955FDF47A3
+ .quad 0x03FD79E26687CFB3D # 0.369027711906 457
+ .quad 0x03FD79E26687CFB3D
+ .quad 0x03FD7A9B9889F19E2 # 0.369734176082 458
+ .quad 0x03FD7A9B9889F19E2
+ .quad 0x03FD7B54EC1077A48 # 0.370441139703 459
+ .quad 0x03FD7B54EC1077A48
+ .quad 0x03FD7C0E612785C74 # 0.371148603475 460
+ .quad 0x03FD7C0E612785C74
+ .quad 0x03FD7C998F06FB152 # 0.371679529954 461
+ .quad 0x03FD7C998F06FB152
+ .quad 0x03FD7D533EF841E8A # 0.372387870696 462
+ .quad 0x03FD7D533EF841E8A
+ .quad 0x03FD7E0D109B95F19 # 0.373096713539 463
+ .quad 0x03FD7E0D109B95F19
+ .quad 0x03FD7EC703FD340AA # 0.373806059198 464
+ .quad 0x03FD7EC703FD340AA
+ .quad 0x03FD7F8119295FB9B # 0.374515908385 465
+ .quad 0x03FD7F8119295FB9B
+ .quad 0x03FD800CBF3ED1CC2 # 0.375048626146 466
+ .quad 0x03FD800CBF3ED1CC2
+ .quad 0x03FD80C70FAB0BDF6 # 0.375759358229 467
+ .quad 0x03FD80C70FAB0BDF6
+ .quad 0x03FD81818203AFC7F # 0.376470595813 468
+ .quad 0x03FD81818203AFC7F
+ .quad 0x03FD823C16551A3C3 # 0.377182339615 469
+ .quad 0x03FD823C16551A3C3
+ .quad 0x03FD82C81BE4DFF4A # 0.377716480107 470
+ .quad 0x03FD82C81BE4DFF4A
+ .quad 0x03FD8382EBC7794D1 # 0.378429111528 471
+ .quad 0x03FD8382EBC7794D1
+ .quad 0x03FD843DDDC4FB137 # 0.379142251156 472
+ .quad 0x03FD843DDDC4FB137
+ .quad 0x03FD84F8F1E9DB72B # 0.379855899714 473
+ .quad 0x03FD84F8F1E9DB72B
+ .quad 0x03FD85855776DCBFB # 0.380391470556 474
+ .quad 0x03FD85855776DCBFB
+ .quad 0x03FD8640A77EB3957 # 0.381106011494 475
+ .quad 0x03FD8640A77EB3957
+ .quad 0x03FD86FC19D05148E # 0.381821063366 476
+ .quad 0x03FD86FC19D05148E
+ .quad 0x03FD87B7AE7845C0F # 0.382536626902 477
+ .quad 0x03FD87B7AE7845C0F
+ .quad 0x03FD8844748678822 # 0.383073635776 478
+ .quad 0x03FD8844748678822
+ .quad 0x03FD89004563D3DFD # 0.383790096491 479
+ .quad 0x03FD89004563D3DFD
+ .quad 0x03FD89BC38BA356B4 # 0.384507070890 480
+ .quad 0x03FD89BC38BA356B4
+ .quad 0x03FD8A4945E20894E # 0.385045139237 481
+ .quad 0x03FD8A4945E20894E
+ .quad 0x03FD8B0575AAB1FC5 # 0.385763014358 482
+ .quad 0x03FD8B0575AAB1FC5
+ .quad 0x03FD8BC1C80F45A32 # 0.386481405193 483
+ .quad 0x03FD8BC1C80F45A32
+ .quad 0x03FD8C7E3D1C80B2F # 0.387200312485 484
+ .quad 0x03FD8C7E3D1C80B2F
+ .quad 0x03FD8D0BABACC89EE # 0.387739832326 485
+ .quad 0x03FD8D0BABACC89EE
+ .quad 0x03FD8DC85D7FE5013 # 0.388459645206 486
+ .quad 0x03FD8DC85D7FE5013
+ .quad 0x03FD8E85321ED5598 # 0.389179976589 487
+ .quad 0x03FD8E85321ED5598
+ .quad 0x03FD8F12E873862C7 # 0.389720565845 488
+ .quad 0x03FD8F12E873862C7
+ .quad 0x03FD8FCFFA1614AA0 # 0.390441806410 489
+ .quad 0x03FD8FCFFA1614AA0
+ .quad 0x03FD908D2EA7D9511 # 0.391163567538 490
+ .quad 0x03FD908D2EA7D9511
+ .quad 0x03FD911B2D09ED9D6 # 0.391705230456 491
+ .quad 0x03FD911B2D09ED9D6
+ .quad 0x03FD91D89EDD6B7FF # 0.392427904381 492
+ .quad 0x03FD91D89EDD6B7FF
+ .quad 0x03FD929633C3B7D3E # 0.393151100941 493
+ .quad 0x03FD929633C3B7D3E
+ .quad 0x03FD93247A7C99B52 # 0.393693841796 494
+ .quad 0x03FD93247A7C99B52
+ .quad 0x03FD93E24CE3195E8 # 0.394417954789 495
+ .quad 0x03FD93E24CE3195E8
+ .quad 0x03FD9470C1CB1962E # 0.394961383840 496
+ .quad 0x03FD9470C1CB1962E
+ .quad 0x03FD952ED1D9C0435 # 0.395686415592 497
+ .quad 0x03FD952ED1D9C0435
+ .quad 0x03FD95ED0535EA5D9 # 0.396411973396 498
+ .quad 0x03FD95ED0535EA5D9
+ .quad 0x03FD967BC2EDCCE17 # 0.396956487431 499
+ .quad 0x03FD967BC2EDCCE17
+ .quad 0x03FD973A3431356AE # 0.397682967666 500
+ .quad 0x03FD973A3431356AE
+ .quad 0x03FD97F8C8E64A1C7 # 0.398409976059 501
+ .quad 0x03FD97F8C8E64A1C7
+ .quad 0x03FD9887CFB8A3932 # 0.398955579419 502
+ .quad 0x03FD9887CFB8A3932
+ .quad 0x03FD9946A2946EF3C # 0.399683513937 503
+ .quad 0x03FD9946A2946EF3C
+ .quad 0x03FD99D5D8130607C # 0.400229812776 504
+ .quad 0x03FD99D5D8130607C
+ .quad 0x03FD9A94E93E1EC37 # 0.400958675782 505
+ .quad 0x03FD9A94E93E1EC37
+ .quad 0x03FD9B244D87735E8 # 0.401505671875 506
+ .quad 0x03FD9B244D87735E8
+ .quad 0x03FD9BE39D2A97F0B # 0.402235465741 507
+ .quad 0x03FD9BE39D2A97F0B
+ .quad 0x03FD9CA3109266E23 # 0.402965792595 508
+ .quad 0x03FD9CA3109266E23
+ .quad 0x03FD9D32BEA15ED3A # 0.403513887977 509
+ .quad 0x03FD9D32BEA15ED3A
+ .quad 0x03FD9DF270C1914A8 # 0.404245149435 510
+ .quad 0x03FD9DF270C1914A8
+ .quad 0x03FD9E824DEA3E135 # 0.404793946669 511
+ .quad 0x03FD9E824DEA3E135
+ .quad 0x03FD9F423EEBF9DA1 # 0.405526145127 512
+ .quad 0x03FD9F423EEBF9DA1
+ .quad 0x03FD9FD24B4D47012 # 0.406075646011 513
+ .quad 0x03FD9FD24B4D47012
+ .quad 0x03FDA0927B59DA6E2 # 0.406808783874 514
+ .quad 0x03FDA0927B59DA6E2
+ .quad 0x03FDA152CF7F3B46D # 0.407542459622 515
+ .quad 0x03FDA152CF7F3B46D
+ .quad 0x03FDA1E32653B420E # 0.408093069896 516
+ .quad 0x03FDA1E32653B420E
+ .quad 0x03FDA2A3B9C527DB1 # 0.408827688845 517
+ .quad 0x03FDA2A3B9C527DB1
+ .quad 0x03FDA33440224FA79 # 0.409379007429 518
+ .quad 0x03FDA33440224FA79
+ .quad 0x03FDA3F513098DD09 # 0.410114572008 519
+ .quad 0x03FDA3F513098DD09
+ .quad 0x03FDA485C90EBDB0C # 0.410666600728 520
+ .quad 0x03FDA485C90EBDB0C
+ .quad 0x03FDA546DB95A721A # 0.411403113374 521
+ .quad 0x03FDA546DB95A721A
+ .quad 0x03FDA5D7C16257437 # 0.411955854060 522
+ .quad 0x03FDA5D7C16257437
+ .quad 0x03FDA69913B2F6572 # 0.412693317221 523
+ .quad 0x03FDA69913B2F6572
+ .quad 0x03FDA72A2966BE1EA # 0.413246771713 524
+ .quad 0x03FDA72A2966BE1EA
+ .quad 0x03FDA7EBBBAB46E8B # 0.413985187844 525
+ .quad 0x03FDA7EBBBAB46E8B
+ .quad 0x03FDA87D0165DD199 # 0.414539357989 526
+ .quad 0x03FDA87D0165DD199
+ .quad 0x03FDA93ED3C8AD9E3 # 0.415278729556 527
+ .quad 0x03FDA93ED3C8AD9E3
+ .quad 0x03FDA9D049A9E884A # 0.415833617206 528
+ .quad 0x03FDA9D049A9E884A
+ .quad 0x03FDAA925C5588EFA # 0.416573946686 529
+ .quad 0x03FDAA925C5588EFA
+ .quad 0x03FDAB24027D5E8AF # 0.417129553701 530
+ .quad 0x03FDAB24027D5E8AF
+ .quad 0x03FDABE6559C8167C # 0.417870843580 531
+ .quad 0x03FDABE6559C8167C
+ .quad 0x03FDAC782C2B07944 # 0.418427171828 532
+ .quad 0x03FDAC782C2B07944
+ .quad 0x03FDAD3ABFE88A06E # 0.419169424599 533
+ .quad 0x03FDAD3ABFE88A06E
+ .quad 0x03FDADCCC6FDF6A80 # 0.419726475955 534
+ .quad 0x03FDADCCC6FDF6A80
+ .quad 0x03FDAE5EE2E961227 # 0.420283837790 535
+ .quad 0x03FDAE5EE2E961227
+ .quad 0x03FDAF21D34189D0A # 0.421027470470 536
+ .quad 0x03FDAF21D34189D0A
+ .quad 0x03FDAFB41FE2167B4 # 0.421585558104 537
+ .quad 0x03FDAFB41FE2167B4
+ .quad 0x03FDB07751416A7F3 # 0.422330159776 538
+ .quad 0x03FDB07751416A7F3
+ .quad 0x03FDB109CEB79DB8A # 0.422888975102 539
+ .quad 0x03FDB109CEB79DB8A
+ .quad 0x03FDB1CD41498DF12 # 0.423634548296 540
+ .quad 0x03FDB1CD41498DF12
+ .quad 0x03FDB25FEFB60CB2E # 0.424194093214 541
+ .quad 0x03FDB25FEFB60CB2E
+ .quad 0x03FDB323A3A63594A # 0.424940640468 542
+ .quad 0x03FDB323A3A63594A
+ .quad 0x03FDB3B68329C59E9 # 0.425500916886 543
+ .quad 0x03FDB3B68329C59E9
+ .quad 0x03FDB44977C148F1A # 0.426061507389 544
+ .quad 0x03FDB44977C148F1A
+ .quad 0x03FDB50D895F7773A # 0.426809450580 545
+ .quad 0x03FDB50D895F7773A
+ .quad 0x03FDB5A0AF3D169CD # 0.427370775322 546
+ .quad 0x03FDB5A0AF3D169CD
+ .quad 0x03FDB66502A41E541 # 0.428119698779 547
+ .quad 0x03FDB66502A41E541
+ .quad 0x03FDB6F859E8EF639 # 0.428681759684 548
+ .quad 0x03FDB6F859E8EF639
+ .quad 0x03FDB78BC664238C0 # 0.429244136679 549
+ .quad 0x03FDB78BC664238C0
+ .quad 0x03FDB85078123E586 # 0.429994464983 550
+ .quad 0x03FDB85078123E586
+ .quad 0x03FDB8E41624226C5 # 0.430557580905 551
+ .quad 0x03FDB8E41624226C5
+ .quad 0x03FDB9A90A06BCB3D # 0.431308895742 552
+ .quad 0x03FDB9A90A06BCB3D
+ .quad 0x03FDBA3CD9D0B81BD # 0.431872752537 553
+ .quad 0x03FDBA3CD9D0B81BD
+ .quad 0x03FDBAD0BEF3DB164 # 0.432436927446 554
+ .quad 0x03FDBAD0BEF3DB164
+ .quad 0x03FDBB9611B80E2FC # 0.433189656123 555
+ .quad 0x03FDBB9611B80E2FC
+ .quad 0x03FDBC2A28C33B75D # 0.433754574696 556
+ .quad 0x03FDBC2A28C33B75D
+ .quad 0x03FDBCBE553C2BDDF # 0.434319812582 557
+ .quad 0x03FDBCBE553C2BDDF
+ .quad 0x03FDBD84073D8EC2B # 0.435073960430 558
+ .quad 0x03FDBD84073D8EC2B
+ .quad 0x03FDBE1865CEC1EC9 # 0.435639944787 559
+ .quad 0x03FDBE1865CEC1EC9
+ .quad 0x03FDBEACD9E271AD1 # 0.436206249662 560
+ .quad 0x03FDBEACD9E271AD1
+ .quad 0x03FDBF72EB7D20355 # 0.436961822044 561
+ .quad 0x03FDBF72EB7D20355
+ .quad 0x03FDC00791D99132B # 0.437528876213 562
+ .quad 0x03FDC00791D99132B
+ .quad 0x03FDC09C4DCD565AB # 0.438096252115 563
+ .quad 0x03FDC09C4DCD565AB
+ .quad 0x03FDC162BF5DF23E4 # 0.438853254422 564
+ .quad 0x03FDC162BF5DF23E4
+ .quad 0x03FDC1F7ADCB3DAB0 # 0.439421382456 565
+ .quad 0x03FDC1F7ADCB3DAB0
+ .quad 0x03FDC28CB1E4D32FD # 0.439989833442 566
+ .quad 0x03FDC28CB1E4D32FD
+ .quad 0x03FDC35383C8850B0 # 0.440748271097 567
+ .quad 0x03FDC35383C8850B0
+ .quad 0x03FDC3E8BA8CACF27 # 0.441317477070 568
+ .quad 0x03FDC3E8BA8CACF27
+ .quad 0x03FDC47E071233744 # 0.441887007223 569
+ .quad 0x03FDC47E071233744
+ .quad 0x03FDC54539A6ABCD2 # 0.442646885679 570
+ .quad 0x03FDC54539A6ABCD2
+ .quad 0x03FDC5DAB908186FF # 0.443217173690 571
+ .quad 0x03FDC5DAB908186FF
+ .quad 0x03FDC6704E4016FF7 # 0.443787787115 572
+ .quad 0x03FDC6704E4016FF7
+ .quad 0x03FDC737E1E38F4FB # 0.444549111857 573
+ .quad 0x03FDC737E1E38F4FB
+ .quad 0x03FDC7CDAA290FEAD # 0.445120486027 574
+ .quad 0x03FDC7CDAA290FEAD
+ .quad 0x03FDC863885A74D16 # 0.445692186852 575
+ .quad 0x03FDC863885A74D16
+ .quad 0x03FDC8F97C7E299DB # 0.446264214707 576
+ .quad 0x03FDC8F97C7E299DB
+ .quad 0x03FDC9C18EDC7C26B # 0.447027427871 577
+ .quad 0x03FDC9C18EDC7C26B
+ .quad 0x03FDCA57B64E9DB05 # 0.447600220249 578
+ .quad 0x03FDCA57B64E9DB05
+ .quad 0x03FDCAEDF3C88A364 # 0.448173340907 579
+ .quad 0x03FDCAEDF3C88A364
+ .quad 0x03FDCB844750B9995 # 0.448746790220 580
+ .quad 0x03FDCB844750B9995
+ .quad 0x03FDCC4CD90B3ECE5 # 0.449511901199 581
+ .quad 0x03FDCC4CD90B3ECE5
+ .quad 0x03FDCCE3602341C10 # 0.450086118843 582
+ .quad 0x03FDCCE3602341C10
+ .quad 0x03FDCD79FD5F2BC77 # 0.450660666403 583
+ .quad 0x03FDCD79FD5F2BC77
+ .quad 0x03FDCE10B0C581284 # 0.451235544257 584
+ .quad 0x03FDCE10B0C581284
+ .quad 0x03FDCED9C27EC6607 # 0.452002562511 585
+ .quad 0x03FDCED9C27EC6607
+ .quad 0x03FDCF70A9B6D3810 # 0.452578212532 586
+ .quad 0x03FDCF70A9B6D3810
+ .quad 0x03FDD007A72F19BBC # 0.453154194116 587
+ .quad 0x03FDD007A72F19BBC
+ .quad 0x03FDD09EBAEE29DD8 # 0.453730507647 588
+ .quad 0x03FDD09EBAEE29DD8
+ .quad 0x03FDD1684D49F46AE # 0.454499442710 589
+ .quad 0x03FDD1684D49F46AE
+ .quad 0x03FDD1FF951D1F1B3 # 0.455076532271 590
+ .quad 0x03FDD1FF951D1F1B3
+ .quad 0x03FDD296F34D0B65C # 0.455653955057 591
+ .quad 0x03FDD296F34D0B65C
+ .quad 0x03FDD32E67E056BD5 # 0.456231711452 592
+ .quad 0x03FDD32E67E056BD5
+ .quad 0x03FDD3C5F2DDA1840 # 0.456809801843 593
+ .quad 0x03FDD3C5F2DDA1840
+ .quad 0x03FDD490246DEFA6A # 0.457581109247 594
+ .quad 0x03FDD490246DEFA6A
+ .quad 0x03FDD527E3D1B95FC # 0.458159980465 595
+ .quad 0x03FDD527E3D1B95FC
+ .quad 0x03FDD5BFB9B5AE71F # 0.458739186968 596
+ .quad 0x03FDD5BFB9B5AE71F
+ .quad 0x03FDD657A6207C0DB # 0.459318729146 597
+ .quad 0x03FDD657A6207C0DB
+ .quad 0x03FDD6EFA918D25CE # 0.459898607388 598
+ .quad 0x03FDD6EFA918D25CE
+ .quad 0x03FDD7BA7AD9E7DA1 # 0.460672301817 599
+ .quad 0x03FDD7BA7AD9E7DA1
+ .quad 0x03FDD852B28BE5A0F # 0.461252965726 600
+ .quad 0x03FDD852B28BE5A0F
+ .quad 0x03FDD8EB00E1CCE14 # 0.461833967001 601
+ .quad 0x03FDD8EB00E1CCE14
+ .quad 0x03FDD98365E25ABB9 # 0.462415306035 602
+ .quad 0x03FDD98365E25ABB9
+ .quad 0x03FDDA1BE1944F538 # 0.462996983220 603
+ .quad 0x03FDDA1BE1944F538
+ .quad 0x03FDDAE75484C9615 # 0.463773079495 604
+ .quad 0x03FDDAE75484C9615
+ .quad 0x03FDDB8005445488B # 0.464355547233 605
+ .quad 0x03FDDB8005445488B
+ .quad 0x03FDDC18CCCBDCB83 # 0.464938354438 606
+ .quad 0x03FDDC18CCCBDCB83
+ .quad 0x03FDDCB1AB222F33D # 0.465521501504 607
+ .quad 0x03FDDCB1AB222F33D
+ .quad 0x03FDDD4AA04E1C4B7 # 0.466104988830 608
+ .quad 0x03FDDD4AA04E1C4B7
+ .quad 0x03FDDDE3AC56775D2 # 0.466688816812 609
+ .quad 0x03FDDDE3AC56775D2
+ .quad 0x03FDDE7CCF4216D6E # 0.467272985848 610
+ .quad 0x03FDDE7CCF4216D6E
+ .quad 0x03FDDF492177D7BBC # 0.468052409114 611
+ .quad 0x03FDDF492177D7BBC
+ .quad 0x03FDDFE279E5BF4EE # 0.468637375496 612
+ .quad 0x03FDDFE279E5BF4EE
+ .quad 0x03FDE07BE94DCC439 # 0.469222684263 613
+ .quad 0x03FDE07BE94DCC439
+ .quad 0x03FDE1156FB6E2626 # 0.469808335817 614
+ .quad 0x03FDE1156FB6E2626
+ .quad 0x03FDE1AF0D27E88D7 # 0.470394330560 615
+ .quad 0x03FDE1AF0D27E88D7
+ .quad 0x03FDE248C1A7C8C26 # 0.470980668894 616
+ .quad 0x03FDE248C1A7C8C26
+ .quad 0x03FDE2E28D3D701CC # 0.471567351222 617
+ .quad 0x03FDE2E28D3D701CC
+ .quad 0x03FDE37C6FEFCED73 # 0.472154377948 618
+ .quad 0x03FDE37C6FEFCED73
+ .quad 0x03FDE449C232C39D8 # 0.472937616681 619
+ .quad 0x03FDE449C232C39D8
+ .quad 0x03FDE4E3DAEDDB5F6 # 0.473525448578 620
+ .quad 0x03FDE4E3DAEDDB5F6
+ .quad 0x03FDE57E0ADCE1EA5 # 0.474113626224 621
+ .quad 0x03FDE57E0ADCE1EA5
+ .quad 0x03FDE6185206D516F # 0.474702150027 622
+ .quad 0x03FDE6185206D516F
+ .quad 0x03FDE6B2B072B5E6F # 0.475291020395 623
+ .quad 0x03FDE6B2B072B5E6F
+ .quad 0x03FDE74D26278887A # 0.475880237735 624
+ .quad 0x03FDE74D26278887A
+ .quad 0x03FDE7E7B32C5453F # 0.476469802457 625
+ .quad 0x03FDE7E7B32C5453F
+ .quad 0x03FDE882578823D52 # 0.477059714970 626
+ .quad 0x03FDE882578823D52
+ .quad 0x03FDE91D134204C67 # 0.477649975686 627
+ .quad 0x03FDE91D134204C67
+ .quad 0x03FDE9B7E6610815A # 0.478240585015 628
+ .quad 0x03FDE9B7E6610815A
+ .quad 0x03FDEA52D0EC41E5E # 0.478831543369 629
+ .quad 0x03FDEA52D0EC41E5E
+ .quad 0x03FDEB218376ECFC0 # 0.479620031484 630
+ .quad 0x03FDEB218376ECFC0
+ .quad 0x03FDEBBCA4C4E9E87 # 0.480211805838 631
+ .quad 0x03FDEBBCA4C4E9E87
+ .quad 0x03FDEC57DD96CD0CB # 0.480803930597 632
+ .quad 0x03FDEC57DD96CD0CB
+ .quad 0x03FDECF32DF3B887D # 0.481396406174 633
+ .quad 0x03FDECF32DF3B887D
+ .quad 0x03FDED8E95E2D1B88 # 0.481989232987 634
+ .quad 0x03FDED8E95E2D1B88
+ .quad 0x03FDEE2A156B413E5 # 0.482582411453 635
+ .quad 0x03FDEE2A156B413E5
+ .quad 0x03FDEEC5AC9432FCB # 0.483175941987 636
+ .quad 0x03FDEEC5AC9432FCB
+ .quad 0x03FDEF615B64D61C7 # 0.483769825010 637
+ .quad 0x03FDEF615B64D61C7
+ .quad 0x03FDEFFD21E45D0D1 # 0.484364060939 638
+ .quad 0x03FDEFFD21E45D0D1
+ .quad 0x03FDF0990019FD887 # 0.484958650194 639
+ .quad 0x03FDF0990019FD887
+ .quad 0x03FDF134F60CF092D # 0.485553593197 640
+ .quad 0x03FDF134F60CF092D
+ .quad 0x03FDF1D103C4727E4 # 0.486148890367 641
+ .quad 0x03FDF1D103C4727E4
+ .quad 0x03FDF26D2947C2EC5 # 0.486744542127 642
+ .quad 0x03FDF26D2947C2EC5
+ .quad 0x03FDF309669E24CF9 # 0.487340548899 643
+ .quad 0x03FDF309669E24CF9
+ .quad 0x03FDF3A5BBCEDE6E1 # 0.487936911107 644
+ .quad 0x03FDF3A5BBCEDE6E1
+ .quad 0x03FDF44228E13963A # 0.488533629176 645
+ .quad 0x03FDF44228E13963A
+ .quad 0x03FDF4DEADDC82A35 # 0.489130703529 646
+ .quad 0x03FDF4DEADDC82A35
+ .quad 0x03FDF57B4AC80A79A # 0.489728134594 647
+ .quad 0x03FDF57B4AC80A79A
+ .quad 0x03FDF617FFAB248ED # 0.490325922795 648
+ .quad 0x03FDF617FFAB248ED
+ .quad 0x03FDF6B4CC8D27E87 # 0.490924068561 649
+ .quad 0x03FDF6B4CC8D27E87
+ .quad 0x03FDF751B1756EEC8 # 0.491522572320 650
+ .quad 0x03FDF751B1756EEC8
+ .quad 0x03FDF7EEAE6B5761C # 0.492121434499 651
+ .quad 0x03FDF7EEAE6B5761C
+ .quad 0x03FDF88BC3764273B # 0.492720655530 652
+ .quad 0x03FDF88BC3764273B
+ .quad 0x03FDF928F09D94B32 # 0.493320235842 653
+ .quad 0x03FDF928F09D94B32
+ .quad 0x03FDF9C635E8B6192 # 0.493920175866 654
+ .quad 0x03FDF9C635E8B6192
+ .quad 0x03FDFA63935F1208C # 0.494520476034 655
+ .quad 0x03FDFA63935F1208C
+ .quad 0x03FDFB0109081751A # 0.495121136779 656
+ .quad 0x03FDFB0109081751A
+ .quad 0x03FDFB9E96EB38311 # 0.495722158534 657
+ .quad 0x03FDFB9E96EB38311
+ .quad 0x03FDFC3C3D0FEA555 # 0.496323541733 658
+ .quad 0x03FDFC3C3D0FEA555
+ .quad 0x03FDFCD9FB7DA6DEF # 0.496925286812 659
+ .quad 0x03FDFCD9FB7DA6DEF
+ .quad 0x03FDFD77D23BEA634 # 0.497527394206 660
+ .quad 0x03FDFD77D23BEA634
+ .quad 0x03FDFE15C15234EE2 # 0.498129864352 661
+ .quad 0x03FDFE15C15234EE2
+ .quad 0x03FDFEB3C8C80A04E # 0.498732697687 662
+ .quad 0x03FDFEB3C8C80A04E
+ .quad 0x03FDFF51E8A4F0A74 # 0.499335894649 663
+ .quad 0x03FDFF51E8A4F0A74
+ .quad 0x03FDFFF020F07352E # 0.499939455677 664
+ .quad 0x03FDFFF020F07352E
+ .quad 0x03FE004738D910023 # 0.500543381211 665
+ .quad 0x03FE004738D910023
+ .quad 0x03FE00966D78C41CF # 0.501147671692 666
+ .quad 0x03FE00966D78C41CF
+ .quad 0x03FE00E5AE5B207AB # 0.501752327560 667
+ .quad 0x03FE00E5AE5B207AB
+ .quad 0x03FE011A8B18F0ED6 # 0.502155634684 668
+ .quad 0x03FE011A8B18F0ED6
+ .quad 0x03FE0169E072D7311 # 0.502760900515 669
+ .quad 0x03FE0169E072D7311
+ .quad 0x03FE01B942198A5A1 # 0.503366532915 670
+ .quad 0x03FE01B942198A5A1
+ .quad 0x03FE0208B010DB642 # 0.503972532327 671
+ .quad 0x03FE0208B010DB642
+ .quad 0x03FE02582A5C9D122 # 0.504578899198 672
+ .quad 0x03FE02582A5C9D122
+ .quad 0x03FE02A7B100A3EF0 # 0.505185633972 673
+ .quad 0x03FE02A7B100A3EF0
+ .quad 0x03FE02F74400C64EA # 0.505792737097 674
+ .quad 0x03FE02F74400C64EA
+ .quad 0x03FE0346E360DC4F9 # 0.506400209020 675
+ .quad 0x03FE0346E360DC4F9
+ .quad 0x03FE03968F24BFDB6 # 0.507008050190 676
+ .quad 0x03FE03968F24BFDB6
+ .quad 0x03FE03E647504CA89 # 0.507616261055 677
+ .quad 0x03FE03E647504CA89
+ .quad 0x03FE04360BE7603AE # 0.508224842066 678
+ .quad 0x03FE04360BE7603AE
+ .quad 0x03FE046B4089BE0FD # 0.508630768599 679
+ .quad 0x03FE046B4089BE0FD
+ .quad 0x03FE04BB19DCA36B3 # 0.509239967521 680
+ .quad 0x03FE04BB19DCA36B3
+ .quad 0x03FE050AFFA5671A5 # 0.509849537793 681
+ .quad 0x03FE050AFFA5671A5
+ .quad 0x03FE055AF1E7ED47B # 0.510459479867 682
+ .quad 0x03FE055AF1E7ED47B
+ .quad 0x03FE05AAF0A81BF04 # 0.511069794198 683
+ .quad 0x03FE05AAF0A81BF04
+ .quad 0x03FE05FAFBE9DAE58 # 0.511680481240 684
+ .quad 0x03FE05FAFBE9DAE58
+ .quad 0x03FE064B13B113CDD # 0.512291541448 685
+ .quad 0x03FE064B13B113CDD
+ .quad 0x03FE069B3801B2263 # 0.512902975280 686
+ .quad 0x03FE069B3801B2263
+ .quad 0x03FE06D0AC85B63A2 # 0.513310805628 687
+ .quad 0x03FE06D0AC85B63A2
+ .quad 0x03FE0720E5C40DF1D # 0.513922863181 688
+ .quad 0x03FE0720E5C40DF1D
+ .quad 0x03FE07712B9648153 # 0.514535295577 689
+ .quad 0x03FE07712B9648153
+ .quad 0x03FE07C17E0056E7C # 0.515148103277 690
+ .quad 0x03FE07C17E0056E7C
+ .quad 0x03FE0811DD062E889 # 0.515761286740 691
+ .quad 0x03FE0811DD062E889
+ .quad 0x03FE086248ABC4F3B # 0.516374846428 692
+ .quad 0x03FE086248ABC4F3B
+ .quad 0x03FE08B2C0F512033 # 0.516988782802 693
+ .quad 0x03FE08B2C0F512033
+ .quad 0x03FE08E86D82DA3EE # 0.517398283218 694
+ .quad 0x03FE08E86D82DA3EE
+ .quad 0x03FE0938FAE5D8E9B # 0.518012848432 695
+ .quad 0x03FE0938FAE5D8E9B
+ .quad 0x03FE098994F72C539 # 0.518627791569 696
+ .quad 0x03FE098994F72C539
+ .quad 0x03FE09DA3BBAD339C # 0.519243113094 697
+ .quad 0x03FE09DA3BBAD339C
+ .quad 0x03FE0A2AEF34CE3D1 # 0.519858813473 698
+ .quad 0x03FE0A2AEF34CE3D1
+ .quad 0x03FE0A7BAF691FE34 # 0.520474893172 699
+ .quad 0x03FE0A7BAF691FE34
+ .quad 0x03FE0AB18BF5823C3 # 0.520885823936 700
+ .quad 0x03FE0AB18BF5823C3
+ .quad 0x03FE0B02616952989 # 0.521502536876 701
+ .quad 0x03FE0B02616952989
+ .quad 0x03FE0B5343A234476 # 0.522119630385 702
+ .quad 0x03FE0B5343A234476
+ .quad 0x03FE0BA432A430CA2 # 0.522737104934 703
+ .quad 0x03FE0BA432A430CA2
+ .quad 0x03FE0BF52E73538CE # 0.523354960993 704
+ .quad 0x03FE0BF52E73538CE
+ .quad 0x03FE0C463713A9E6F # 0.523973199034 705
+ .quad 0x03FE0C463713A9E6F
+ .quad 0x03FE0C7C43F4C861E # 0.524385570174 706
+ .quad 0x03FE0C7C43F4C861E
+ .quad 0x03FE0CCD61FAD07D2 # 0.525004445903 707
+ .quad 0x03FE0CCD61FAD07D2
+ .quad 0x03FE0D1E8CDCE3DB6 # 0.525623704876 708
+ .quad 0x03FE0D1E8CDCE3DB6
+ .quad 0x03FE0D6FC49F16E93 # 0.526243347569 709
+ .quad 0x03FE0D6FC49F16E93
+ .quad 0x03FE0DC109458004A # 0.526863374456 710
+ .quad 0x03FE0DC109458004A
+ .quad 0x03FE0DF73E353F0ED # 0.527276939392 711
+ .quad 0x03FE0DF73E353F0ED
+ .quad 0x03FE0E4898611CCE1 # 0.527897607665 712
+ .quad 0x03FE0E4898611CCE1
+ .quad 0x03FE0E99FF7C20738 # 0.528518661406 713
+ .quad 0x03FE0E99FF7C20738
+ .quad 0x03FE0EEB738A67874 # 0.529140101094 714
+ .quad 0x03FE0EEB738A67874
+ .quad 0x03FE0F21C81D1ADC3 # 0.529554608872 715
+ .quad 0x03FE0F21C81D1ADC3
+ .quad 0x03FE0F7351C9FCD7F # 0.530176692874 716
+ .quad 0x03FE0F7351C9FCD7F
+ .quad 0x03FE0FC4E875254C1 # 0.530799164104 717
+ .quad 0x03FE0FC4E875254C1
+ .quad 0x03FE10168C22B8FB9 # 0.531422023047 718
+ .quad 0x03FE10168C22B8FB9
+ .quad 0x03FE10683CD6DEA54 # 0.532045270185 719
+ .quad 0x03FE10683CD6DEA54
+ .quad 0x03FE109EB9E2E4C97 # 0.532460984179 720
+ .quad 0x03FE109EB9E2E4C97
+ .quad 0x03FE10F08055E7785 # 0.533084879385 721
+ .quad 0x03FE10F08055E7785
+ .quad 0x03FE114253DA97DA0 # 0.533709164079 722
+ .quad 0x03FE114253DA97DA0
+ .quad 0x03FE1194347523FDC # 0.534333838748 723
+ .quad 0x03FE1194347523FDC
+ .quad 0x03FE11CAD1789B0F8 # 0.534750505421 724
+ .quad 0x03FE11CAD1789B0F8
+ .quad 0x03FE121CC7EB8F7E6 # 0.535375831132 725
+ .quad 0x03FE121CC7EB8F7E6
+ .quad 0x03FE126ECB7F8F007 # 0.536001548120 726
+ .quad 0x03FE126ECB7F8F007
+ .quad 0x03FE12A57FDA37091 # 0.536418910396 727
+ .quad 0x03FE12A57FDA37091
+ .quad 0x03FE12F799594EFBC # 0.537045280601 728
+ .quad 0x03FE12F799594EFBC
+ .quad 0x03FE1349C004AFB00 # 0.537672043392 729
+ .quad 0x03FE1349C004AFB00
+ .quad 0x03FE139BF3E094003 # 0.538299199261 730
+ .quad 0x03FE139BF3E094003
+ .quad 0x03FE13D2C873C5E13 # 0.538717521794 731
+ .quad 0x03FE13D2C873C5E13
+ .quad 0x03FE142512549C16C # 0.539345333889 732
+ .quad 0x03FE142512549C16C
+ .quad 0x03FE14776971477F1 # 0.539973540381 733
+ .quad 0x03FE14776971477F1
+ .quad 0x03FE14C9CDCE0A74D # 0.540602141763 734
+ .quad 0x03FE14C9CDCE0A74D
+ .quad 0x03FE1500C2BFD1561 # 0.541021428981 735
+ .quad 0x03FE1500C2BFD1561
+ .quad 0x03FE15533D3B8D7B3 # 0.541650689621 736
+ .quad 0x03FE15533D3B8D7B3
+ .quad 0x03FE15A5C502C6DC5 # 0.542280346478 737
+ .quad 0x03FE15A5C502C6DC5
+ .quad 0x03FE15DCD1973457B # 0.542700338085 738
+ .quad 0x03FE15DCD1973457B
+ .quad 0x03FE162F6F9071F76 # 0.543330656416 739
+ .quad 0x03FE162F6F9071F76
+ .quad 0x03FE16821AE0A13C6 # 0.543961372300 740
+ .quad 0x03FE16821AE0A13C6
+ .quad 0x03FE16B93F2C12808 # 0.544382070665 741
+ .quad 0x03FE16B93F2C12808
+ .quad 0x03FE170C00C169B51 # 0.545013450251 742
+ .quad 0x03FE170C00C169B51
+ .quad 0x03FE175ECFB935CC6 # 0.545645228728 743
+ .quad 0x03FE175ECFB935CC6
+ .quad 0x03FE17B1AC17CBD5B # 0.546277406602 744
+ .quad 0x03FE17B1AC17CBD5B
+ .quad 0x03FE17E8F12052E8A # 0.546699080654 745
+ .quad 0x03FE17E8F12052E8A
+ .quad 0x03FE183BE3DE8A7AF # 0.547331925312 746
+ .quad 0x03FE183BE3DE8A7AF
+ .quad 0x03FE188EE40F23CA7 # 0.547965170715 747
+ .quad 0x03FE188EE40F23CA7
+ .quad 0x03FE18C640FF75F06 # 0.548387557205 748
+ .quad 0x03FE18C640FF75F06
+ .quad 0x03FE191957A30FA51 # 0.549021471648 749
+ .quad 0x03FE191957A30FA51
+ .quad 0x03FE196C7BC4B1F3A # 0.549655788193 750
+ .quad 0x03FE196C7BC4B1F3A
+ .quad 0x03FE19A3F0B1860BD # 0.550078889532 751
+ .quad 0x03FE19A3F0B1860BD
+ .quad 0x03FE19F72B59A0CEC # 0.550713877383 752
+ .quad 0x03FE19F72B59A0CEC
+ .quad 0x03FE1A4A738B7A33C # 0.551349268700 753
+ .quad 0x03FE1A4A738B7A33C
+ .quad 0x03FE1A820089A2156 # 0.551773087312 754
+ .quad 0x03FE1A820089A2156
+ .quad 0x03FE1AD55F55855C8 # 0.552409152212 755
+ .quad 0x03FE1AD55F55855C8
+ .quad 0x03FE1B28CBB6EC93E # 0.553045621948 756
+ .quad 0x03FE1B28CBB6EC93E
+ .quad 0x03FE1B6070DB553D8 # 0.553470160269 757
+ .quad 0x03FE1B6070DB553D8
+ .quad 0x03FE1BB3F3EA714F6 # 0.554107305878 758
+ .quad 0x03FE1BB3F3EA714F6
+ .quad 0x03FE1BEBA8316EF2C # 0.554532295260 759
+ .quad 0x03FE1BEBA8316EF2C
+ .quad 0x03FE1C3F41FA97C6B # 0.555170118179 760
+ .quad 0x03FE1C3F41FA97C6B
+ .quad 0x03FE1C92E96C86020 # 0.555808348176 761
+ .quad 0x03FE1C92E96C86020
+ .quad 0x03FE1CCAB5FBFFEE1 # 0.556234061252 762
+ .quad 0x03FE1CCAB5FBFFEE1
+ .quad 0x03FE1D1E743BCFC47 # 0.556872970868 763
+ .quad 0x03FE1D1E743BCFC47
+ .quad 0x03FE1D72403052E75 # 0.557512288951 764
+ .quad 0x03FE1D72403052E75
+ .quad 0x03FE1DAA251D7E433 # 0.557938728190 765
+ .quad 0x03FE1DAA251D7E433
+ .quad 0x03FE1DFE07F3D1DAB # 0.558578728212 766
+ .quad 0x03FE1DFE07F3D1DAB
+ .quad 0x03FE1E35FC265D75E # 0.559005622562 767
+ .quad 0x03FE1E35FC265D75E
+ .quad 0x03FE1E89F5EB04126 # 0.559646305979 768
+ .quad 0x03FE1E89F5EB04126
+ .quad 0x03FE1EDDFD77E1FEF # 0.560287400135 769
+ .quad 0x03FE1EDDFD77E1FEF
+ .quad 0x03FE1F160A2AD0DA3 # 0.560715024687 770
+ .quad 0x03FE1F160A2AD0DA3
+ .quad 0x03FE1F6A28BA1B476 # 0.561356804579 771
+ .quad 0x03FE1F6A28BA1B476
+ .quad 0x03FE1FBE551DB43C1 # 0.561998996616 772
+ .quad 0x03FE1FBE551DB43C1
+ .quad 0x03FE1FF67A6684F47 # 0.562427353873 773
+ .quad 0x03FE1FF67A6684F47
+ .quad 0x03FE204ABDE0BE5DF # 0.563070233998 774
+ .quad 0x03FE204ABDE0BE5DF
+ .quad 0x03FE2082F29233211 # 0.563499050471 775
+ .quad 0x03FE2082F29233211
+ .quad 0x03FE20D74D2FBAFE4 # 0.564142620160 776
+ .quad 0x03FE20D74D2FBAFE4
+ .quad 0x03FE210F91524B469 # 0.564571896835 777
+ .quad 0x03FE210F91524B469
+ .quad 0x03FE2164031FDA0B0 # 0.565216157568 778
+ .quad 0x03FE2164031FDA0B0
+ .quad 0x03FE21B882DD26040 # 0.565860833641 779
+ .quad 0x03FE21B882DD26040
+ .quad 0x03FE21F0DFC65CEEC # 0.566290848698 780
+ .quad 0x03FE21F0DFC65CEEC
+ .quad 0x03FE224576C81FFE0 # 0.566936218194 781
+ .quad 0x03FE224576C81FFE0
+ .quad 0x03FE227DE33896A44 # 0.567366696031 782
+ .quad 0x03FE227DE33896A44
+ .quad 0x03FE22D2918BA4A31 # 0.568012760445 783
+ .quad 0x03FE22D2918BA4A31
+ .quad 0x03FE23274DE272A83 # 0.568659242528 784
+ .quad 0x03FE23274DE272A83
+ .quad 0x03FE235FD33D232FC # 0.569090462888 785
+ .quad 0x03FE235FD33D232FC
+ .quad 0x03FE23B4A6F9D8688 # 0.569737642287 786
+ .quad 0x03FE23B4A6F9D8688
+ .quad 0x03FE23ED3BF21CA33 # 0.570169328026 787
+ .quad 0x03FE23ED3BF21CA33
+ .quad 0x03FE24422721A89D7 # 0.570817206248 788
+ .quad 0x03FE24422721A89D7
+ .quad 0x03FE247ACBC023D2B # 0.571249358372 789
+ .quad 0x03FE247ACBC023D2B
+ .quad 0x03FE24CFCE6F80D9B # 0.571897936927 790
+ .quad 0x03FE24CFCE6F80D9B
+ .quad 0x03FE250882BCDD7D8 # 0.572330556445 791
+ .quad 0x03FE250882BCDD7D8
+ .quad 0x03FE255D9CF910A56 # 0.572979836849 792
+ .quad 0x03FE255D9CF910A56
+ .quad 0x03FE25B2C55CD5762 # 0.573629539091 793
+ .quad 0x03FE25B2C55CD5762
+ .quad 0x03FE25EB92D41992D # 0.574062908546 794
+ .quad 0x03FE25EB92D41992D
+ .quad 0x03FE2640D2D99FFEA # 0.574713315073 795
+ .quad 0x03FE2640D2D99FFEA
+ .quad 0x03FE2679B0166F51C # 0.575147154559 796
+ .quad 0x03FE2679B0166F51C
+ .quad 0x03FE26CF07CAD8B00 # 0.575798266899 797
+ .quad 0x03FE26CF07CAD8B00
+ .quad 0x03FE2707F4D5F7C40 # 0.576232577438 798
+ .quad 0x03FE2707F4D5F7C40
+ .quad 0x03FE275D644670606 # 0.576884397124 799
+ .quad 0x03FE275D644670606
+ .quad 0x03FE27966128AB11B # 0.577319179739 800
+ .quad 0x03FE27966128AB11B
+ .quad 0x03FE27EBE8626A387 # 0.577971708311 801
+ .quad 0x03FE27EBE8626A387
+ .quad 0x03FE2824F52493BD2 # 0.578406964030 802
+ .quad 0x03FE2824F52493BD2
+ .quad 0x03FE287A9434DBC7B # 0.579060203030 803
+ .quad 0x03FE287A9434DBC7B
+ .quad 0x03FE28B3B0DFCEB80 # 0.579495932884 804
+ .quad 0x03FE28B3B0DFCEB80
+ .quad 0x03FE290967D3ED18D # 0.580149883861 805
+ .quad 0x03FE290967D3ED18D
+ .quad 0x03FE294294708B773 # 0.580586088885 806
+ .quad 0x03FE294294708B773
+ .quad 0x03FE29986355D8C69 # 0.581240753393 807
+ .quad 0x03FE29986355D8C69
+ .quad 0x03FE29D19FED0C082 # 0.581677434622 808
+ .quad 0x03FE29D19FED0C082
+ .quad 0x03FE2A2786D0EC107 # 0.582332814220 809
+ .quad 0x03FE2A2786D0EC107
+ .quad 0x03FE2A60D36BA5253 # 0.582769972697 810
+ .quad 0x03FE2A60D36BA5253
+ .quad 0x03FE2AB6D25B86EF7 # 0.583426068948 811
+ .quad 0x03FE2AB6D25B86EF7
+ .quad 0x03FE2AF02F02BE4AB # 0.583863705716 812
+ .quad 0x03FE2AF02F02BE4AB
+ .quad 0x03FE2B46460C1C2B3 # 0.584520520190 813
+ .quad 0x03FE2B46460C1C2B3
+ .quad 0x03FE2B7FB2C8D1CC1 # 0.584958636297 814
+ .quad 0x03FE2B7FB2C8D1CC1
+ .quad 0x03FE2BD5E1F9316F2 # 0.585616170568 815
+ .quad 0x03FE2BD5E1F9316F2
+ .quad 0x03FE2C0F5ED46CE8D # 0.586054767066 816
+ .quad 0x03FE2C0F5ED46CE8D
+ .quad 0x03FE2C65A6395F5F5 # 0.586713022712 817
+ .quad 0x03FE2C65A6395F5F5
+ .quad 0x03FE2C9F333C2FE1E # 0.587152100656 818
+ .quad 0x03FE2C9F333C2FE1E
+ .quad 0x03FE2CF592E351AE5 # 0.587811079263 819
+ .quad 0x03FE2CF592E351AE5
+ .quad 0x03FE2D2F3016CE0EF # 0.588250639709 820
+ .quad 0x03FE2D2F3016CE0EF
+ .quad 0x03FE2D85A80DC7324 # 0.588910342867 821
+ .quad 0x03FE2D85A80DC7324
+ .quad 0x03FE2DBF557B0DF43 # 0.589350386878 822
+ .quad 0x03FE2DBF557B0DF43
+ .quad 0x03FE2E15E5CF91FA7 # 0.590010816181 823
+ .quad 0x03FE2E15E5CF91FA7
+ .quad 0x03FE2E4FA37FC9577 # 0.590451344823 824
+ .quad 0x03FE2E4FA37FC9577
+ .quad 0x03FE2E8967B3BF4E1 # 0.590892067615 825
+ .quad 0x03FE2E8967B3BF4E1
+ .quad 0x03FE2EE01A3BED567 # 0.591553516212 826
+ .quad 0x03FE2EE01A3BED567
+ .quad 0x03FE2F19EEBFB00BA # 0.591994725131 827
+ .quad 0x03FE2F19EEBFB00BA
+ .quad 0x03FE2F70B9C67A7C2 # 0.592656903723 828
+ .quad 0x03FE2F70B9C67A7C2
+ .quad 0x03FE2FAA9EA342D04 # 0.593098599843 829
+ .quad 0x03FE2FAA9EA342D04
+ .quad 0x03FE3001823684D73 # 0.593761510043 830
+ .quad 0x03FE3001823684D73
+ .quad 0x03FE303B7775937EF # 0.594203694441 831
+ .quad 0x03FE303B7775937EF
+ .quad 0x03FE309273A3340FC # 0.594867337868 832
+ .quad 0x03FE309273A3340FC
+ .quad 0x03FE30CC794DD19D0 # 0.595310011625 833
+ .quad 0x03FE30CC794DD19D0
+ .quad 0x03FE3106858C76BB7 # 0.595752881428 834
+ .quad 0x03FE3106858C76BB7
+ .quad 0x03FE315DA4434068B # 0.596417554101 835
+ .quad 0x03FE315DA4434068B
+ .quad 0x03FE3197C0FA80E6A # 0.596860914783 836
+ .quad 0x03FE3197C0FA80E6A
+ .quad 0x03FE31EEF86D36EF1 # 0.597526324589 837
+ .quad 0x03FE31EEF86D36EF1
+ .quad 0x03FE322925A66E62D # 0.597970177237 838
+ .quad 0x03FE322925A66E62D
+ .quad 0x03FE328075E32022F # 0.598636325813 839
+ .quad 0x03FE328075E32022F
+ .quad 0x03FE32BAB3A7B21E9 # 0.599080671521 840
+ .quad 0x03FE32BAB3A7B21E9
+ .quad 0x03FE32F4F80D0B1BD # 0.599525214760 841
+ .quad 0x03FE32F4F80D0B1BD
+ .quad 0x03FE334C6B15D30DD # 0.600192400374 842
+ .quad 0x03FE334C6B15D30DD
+ .quad 0x03FE3386C013B90D6 # 0.600637438209 843
+ .quad 0x03FE3386C013B90D6
+ .quad 0x03FE33DE4C086C40A # 0.601305366543 844
+ .quad 0x03FE33DE4C086C40A
+ .quad 0x03FE3418B1A85622C # 0.601750900077 845
+ .quad 0x03FE3418B1A85622C
+ .quad 0x03FE34531DF21CFE3 # 0.602196632199 846
+ .quad 0x03FE34531DF21CFE3
+ .quad 0x03FE34AACCE299BA5 # 0.602865603124 847
+ .quad 0x03FE34AACCE299BA5
+ .quad 0x03FE34E549DBB21EF # 0.603311832493 848
+ .quad 0x03FE34E549DBB21EF
+ .quad 0x03FE353D11DA4F855 # 0.603981550121 849
+ .quad 0x03FE353D11DA4F855
+ .quad 0x03FE35779F8C43D6D # 0.604428277847 850
+ .quad 0x03FE35779F8C43D6D
+ .quad 0x03FE35B233F13DD4A # 0.604875205229 851
+ .quad 0x03FE35B233F13DD4A
+ .quad 0x03FE360A1F1BBA738 # 0.605545971045 852
+ .quad 0x03FE360A1F1BBA738
+ .quad 0x03FE3644C446F97BC # 0.605993398346 853
+ .quad 0x03FE3644C446F97BC
+ .quad 0x03FE367F702A9EA94 # 0.606441025927 854
+ .quad 0x03FE367F702A9EA94
+ .quad 0x03FE36D77E9D34FD7 # 0.607112843218 855
+ .quad 0x03FE36D77E9D34FD7
+ .quad 0x03FE37123B54987B7 # 0.607560972287 856
+ .quad 0x03FE37123B54987B7
+ .quad 0x03FE376A630C0A1D6 # 0.608233542652 857
+ .quad 0x03FE376A630C0A1D6
+ .quad 0x03FE37A530A0D5A31 # 0.608682174333 858
+ .quad 0x03FE37A530A0D5A31
+ .quad 0x03FE37E004F74E13B # 0.609131007374 859
+ .quad 0x03FE37E004F74E13B
+ .quad 0x03FE383850278CFD9 # 0.609804634884 860
+ .quad 0x03FE383850278CFD9
+ .quad 0x03FE3873356902AB7 # 0.610253972119 861
+ .quad 0x03FE3873356902AB7
+ .quad 0x03FE38AE2171976E8 # 0.610703511349 862
+ .quad 0x03FE38AE2171976E8
+ .quad 0x03FE390690373AFFF # 0.611378199331 863
+ .quad 0x03FE390690373AFFF
+ .quad 0x03FE39418D3872A53 # 0.611828244343 864
+ .quad 0x03FE39418D3872A53
+ .quad 0x03FE397C91064221F # 0.612278491987 865
+ .quad 0x03FE397C91064221F
+ .quad 0x03FE39D5237E045A5 # 0.612954243787 866
+ .quad 0x03FE39D5237E045A5
+ .quad 0x03FE3A1038522CE82 # 0.613404998809 867
+ .quad 0x03FE3A1038522CE82
+ .quad 0x03FE3A68E45AD354B # 0.614081512534 868
+ .quad 0x03FE3A68E45AD354B
+ .quad 0x03FE3AA40A3F2A68B # 0.614532776080 869
+ .quad 0x03FE3AA40A3F2A68B
+ .quad 0x03FE3ADF36F98A182 # 0.614984243356 870
+ .quad 0x03FE3ADF36F98A182
+ .quad 0x03FE3B3806E5DF340 # 0.615661826668 871
+ .quad 0x03FE3B3806E5DF340
+ .quad 0x03FE3B7344BE40311 # 0.616113804077 872
+ .quad 0x03FE3B7344BE40311
+ .quad 0x03FE3BAE897234A87 # 0.616565985862 873
+ .quad 0x03FE3BAE897234A87
+ .quad 0x03FE3C077D5F51881 # 0.617244642149 874
+ .quad 0x03FE3C077D5F51881
+ .quad 0x03FE3C42D33F2AE7B # 0.617697335683 875
+ .quad 0x03FE3C42D33F2AE7B
+ .quad 0x03FE3C7E30002960C # 0.618150234241 876
+ .quad 0x03FE3C7E30002960C
+ .quad 0x03FE3CD7480B4A8A3 # 0.618829966906 877
+ .quad 0x03FE3CD7480B4A8A3
+ .quad 0x03FE3D12B60622748 # 0.619283378838 878
+ .quad 0x03FE3D12B60622748
+ .quad 0x03FE3D4E2AE7B7E2B # 0.619736996447 879
+ .quad 0x03FE3D4E2AE7B7E2B
+ .quad 0x03FE3D89A6B1A558D # 0.620190819917 880
+ .quad 0x03FE3D89A6B1A558D
+ .quad 0x03FE3DE2ED57B1F9B # 0.620871941524 881
+ .quad 0x03FE3DE2ED57B1F9B
+ .quad 0x03FE3E1E7A6D8330E # 0.621326280468 882
+ .quad 0x03FE3E1E7A6D8330E
+ .quad 0x03FE3E5A0E714DA6E # 0.621780825931 883
+ .quad 0x03FE3E5A0E714DA6E
+ .quad 0x03FE3EB37978B85B6 # 0.622463031756 884
+ .quad 0x03FE3EB37978B85B6
+ .quad 0x03FE3EEF1ED68236B # 0.622918094335 885
+ .quad 0x03FE3EEF1ED68236B
+ .quad 0x03FE3F2ACB27ED6C7 # 0.623373364090 886
+ .quad 0x03FE3F2ACB27ED6C7
+ .quad 0x03FE3F845AAE68C81 # 0.624056657591 887
+ .quad 0x03FE3F845AAE68C81
+ .quad 0x03FE3FC0186800514 # 0.624512446113 888
+ .quad 0x03FE3FC0186800514
+ .quad 0x03FE3FFBDD1AE8406 # 0.624968442473 889
+ .quad 0x03FE3FFBDD1AE8406
+ .quad 0x03FE4037A8C8C197A # 0.625424646860 890
+ .quad 0x03FE4037A8C8C197A
+ .quad 0x03FE409167679DD99 # 0.626109343909 891
+ .quad 0x03FE409167679DD99
+ .quad 0x03FE40CD448FF6DD6 # 0.626566069196 892
+ .quad 0x03FE40CD448FF6DD6
+ .quad 0x03FE410928B8F950F # 0.627023003177 893
+ .quad 0x03FE410928B8F950F
+ .quad 0x03FE41630C1B50AFF # 0.627708795866 894
+ .quad 0x03FE41630C1B50AFF
+ .quad 0x03FE419F01CD27AD0 # 0.628166252416 895
+ .quad 0x03FE419F01CD27AD0
+ .quad 0x03FE41DAFE85672B9 # 0.628623918328 896
+ .quad 0x03FE41DAFE85672B9
+ .quad 0x03FE42170245B4C6A # 0.629081793794 897
+ .quad 0x03FE42170245B4C6A
+ .quad 0x03FE42711518DF546 # 0.629769000326 898
+ .quad 0x03FE42711518DF546
+ .quad 0x03FE42AD2A74888A0 # 0.630227400518 899
+ .quad 0x03FE42AD2A74888A0
+ .quad 0x03FE42E946DE080C0 # 0.630686010936 900
+ .quad 0x03FE42E946DE080C0
+ .quad 0x03FE43437EB9D9424 # 0.631374321162 901
+ .quad 0x03FE43437EB9D9424
+ .quad 0x03FE437FACCD31C10 # 0.631833457993 902
+ .quad 0x03FE437FACCD31C10
+ .quad 0x03FE43BBE1F42FE09 # 0.632292805727 903
+ .quad 0x03FE43BBE1F42FE09
+ .quad 0x03FE43F81E307DE5E # 0.632752364559 904
+ .quad 0x03FE43F81E307DE5E
+ .quad 0x03FE445285D68EA69 # 0.633442099038 905
+ .quad 0x03FE445285D68EA69
+ .quad 0x03FE448ED3CF71355 # 0.633902186463 906
+ .quad 0x03FE448ED3CF71355
+ .quad 0x03FE44CB28E37C3EE # 0.634362485666 907
+ .quad 0x03FE44CB28E37C3EE
+ .quad 0x03FE450785145CAFE # 0.634822996841 908
+ .quad 0x03FE450785145CAFE
+ .quad 0x03FE45621CB769366 # 0.635514161481 909
+ .quad 0x03FE45621CB769366
+ .quad 0x03FE459E8AB7B799D # 0.635975203444 910
+ .quad 0x03FE459E8AB7B799D
+ .quad 0x03FE45DAFFDABD4DB # 0.636436458065 911
+ .quad 0x03FE45DAFFDABD4DB
+ .quad 0x03FE46177C2229EC0 # 0.636897925539 912
+ .quad 0x03FE46177C2229EC0
+ .quad 0x03FE467243F53F69E # 0.637590526283 913
+ .quad 0x03FE467243F53F69E
+ .quad 0x03FE46AED21F117FC # 0.638052526753 914
+ .quad 0x03FE46AED21F117FC
+ .quad 0x03FE46EB677335D13 # 0.638514740766 915
+ .quad 0x03FE46EB677335D13
+ .quad 0x03FE472803F35EAAE # 0.638977168520 916
+ .quad 0x03FE472803F35EAAE
+ .quad 0x03FE4764A7A13EF3B # 0.639439810212 917
+ .quad 0x03FE4764A7A13EF3B
+ .quad 0x03FE47BFAA9F80271 # 0.640134174319 918
+ .quad 0x03FE47BFAA9F80271
+ .quad 0x03FE47FC60471DAF8 # 0.640597351724 919
+ .quad 0x03FE47FC60471DAF8
+ .quad 0x03FE48391D226992D # 0.641060743762 920
+ .quad 0x03FE48391D226992D
+ .quad 0x03FE4875E1331971E # 0.641524350631 921
+ .quad 0x03FE4875E1331971E
+ .quad 0x03FE48D114D3FB884 # 0.642220164181 922
+ .quad 0x03FE48D114D3FB884
+ .quad 0x03FE490DEAF1A3FC8 # 0.642684309003 923
+ .quad 0x03FE490DEAF1A3FC8
+ .quad 0x03FE494AC84AB0ED3 # 0.643148669355 924
+ .quad 0x03FE494AC84AB0ED3
+ .quad 0x03FE4987ACE0DABB0 # 0.643613245438 925
+ .quad 0x03FE4987ACE0DABB0
+ .quad 0x03FE49C498B5DA63F # 0.644078037452 926
+ .quad 0x03FE49C498B5DA63F
+ .quad 0x03FE4A20080EF10B2 # 0.644775630783 927
+ .quad 0x03FE4A20080EF10B2
+ .quad 0x03FE4A5D060894B8C # 0.645240963504 928
+ .quad 0x03FE4A5D060894B8C
+ .quad 0x03FE4A9A0B471A943 # 0.645706512861 929
+ .quad 0x03FE4A9A0B471A943
+ .quad 0x03FE4AD717CC3E626 # 0.646172279055 930
+ .quad 0x03FE4AD717CC3E626
+ .quad 0x03FE4B142B99BC871 # 0.646638262288 931
+ .quad 0x03FE4B142B99BC871
+ .quad 0x03FE4B6FD6F970C1F # 0.647337644529 932
+ .quad 0x03FE4B6FD6F970C1F
+ .quad 0x03FE4BACFD036D080 # 0.647804171246 933
+ .quad 0x03FE4BACFD036D080
+ .quad 0x03FE4BEA2A5BDBE87 # 0.648270915712 934
+ .quad 0x03FE4BEA2A5BDBE87
+ .quad 0x03FE4C275F047C956 # 0.648737878130 935
+ .quad 0x03FE4C275F047C956
+ .quad 0x03FE4C649AFF0EE16 # 0.649205058703 936
+ .quad 0x03FE4C649AFF0EE16
+ .quad 0x03FE4CC082B46485A # 0.649906239052 937
+ .quad 0x03FE4CC082B46485A
+ .quad 0x03FE4CFDD1037E37C # 0.650373965908 938
+ .quad 0x03FE4CFDD1037E37C
+ .quad 0x03FE4D3B26AAADDD9 # 0.650841911635 939
+ .quad 0x03FE4D3B26AAADDD9
+ .quad 0x03FE4D7883ABB61F6 # 0.651310076438 940
+ .quad 0x03FE4D7883ABB61F6
+ .quad 0x03FE4DB5E8085A477 # 0.651778460521 941
+ .quad 0x03FE4DB5E8085A477
+ .quad 0x03FE4DF353C25E42B # 0.652247064091 942
+ .quad 0x03FE4DF353C25E42B
+ .quad 0x03FE4E4F832C560DD # 0.652950381434 943
+ .quad 0x03FE4E4F832C560DD
+ .quad 0x03FE4E8D015786F16 # 0.653419534621 944
+ .quad 0x03FE4E8D015786F16
+ .quad 0x03FE4ECA86E64A683 # 0.653888908016 945
+ .quad 0x03FE4ECA86E64A683
+ .quad 0x03FE4F0813DA673DD # 0.654358501826 946
+ .quad 0x03FE4F0813DA673DD
+ .quad 0x03FE4F45A835A4E19 # 0.654828316258 947
+ .quad 0x03FE4F45A835A4E19
+ .quad 0x03FE4F8343F9CB678 # 0.655298351519 948
+ .quad 0x03FE4F8343F9CB678
+ .quad 0x03FE4FDFBB88A119A # 0.656003818920 949
+ .quad 0x03FE4FDFBB88A119A
+ .quad 0x03FE501D69DADD660 # 0.656474407164 950
+ .quad 0x03FE501D69DADD660
+ .quad 0x03FE505B1F9C43ED7 # 0.656945216966 951
+ .quad 0x03FE505B1F9C43ED7
+ .quad 0x03FE5098DCCE9FABA # 0.657416248534 952
+ .quad 0x03FE5098DCCE9FABA
+ .quad 0x03FE50D6A173BC425 # 0.657887502077 953
+ .quad 0x03FE50D6A173BC425
+ .quad 0x03FE51146D8D65F98 # 0.658358977805 954
+ .quad 0x03FE51146D8D65F98
+ .quad 0x03FE5152411D69C03 # 0.658830675927 955
+ .quad 0x03FE5152411D69C03
+ .quad 0x03FE51AF0C774A2D0 # 0.659538640558 956
+ .quad 0x03FE51AF0C774A2D0
+ .quad 0x03FE51ECF2B713F8A # 0.660010895584 957
+ .quad 0x03FE51ECF2B713F8A
+ .quad 0x03FE522AE0738A3D8 # 0.660483373741 958
+ .quad 0x03FE522AE0738A3D8
+ .quad 0x03FE5268D5AE7CDCB # 0.660956075239 959
+ .quad 0x03FE5268D5AE7CDCB
+ .quad 0x03FE52A6D269BC600 # 0.661429000289 960
+ .quad 0x03FE52A6D269BC600
+ .quad 0x03FE52E4D6A719F9B # 0.661902149103 961
+ .quad 0x03FE52E4D6A719F9B
+ .quad 0x03FE5322E26867857 # 0.662375521893 962
+ .quad 0x03FE5322E26867857
+ .quad 0x03FE53800225BA6E2 # 0.663086001497 963
+ .quad 0x03FE53800225BA6E2
+ .quad 0x03FE53BE20B8DA502 # 0.663559935155 964
+ .quad 0x03FE53BE20B8DA502
+ .quad 0x03FE53FC46D64DDD1 # 0.664034093533 965
+ .quad 0x03FE53FC46D64DDD1
+ .quad 0x03FE543A747FE9ED6 # 0.664508476843 966
+ .quad 0x03FE543A747FE9ED6
+ .quad 0x03FE5478A9B78404C # 0.664983085300 967
+ .quad 0x03FE5478A9B78404C
+ .quad 0x03FE54B6E67EF251C # 0.665457919117 968
+ .quad 0x03FE54B6E67EF251C
+ .quad 0x03FE54F52AD80BAE9 # 0.665932978509 969
+ .quad 0x03FE54F52AD80BAE9
+ .quad 0x03FE553376C4A7A16 # 0.666408263689 970
+ .quad 0x03FE553376C4A7A16
+ .quad 0x03FE5571CA469E5C9 # 0.666883774872 971
+ .quad 0x03FE5571CA469E5C9
+ .quad 0x03FE55CF55C5A5437 # 0.667597465874 972
+ .quad 0x03FE55CF55C5A5437
+ .quad 0x03FE560DBC45153C7 # 0.668073543008 973
+ .quad 0x03FE560DBC45153C7
+ .quad 0x03FE564C2A6059FE7 # 0.668549846899 974
+ .quad 0x03FE564C2A6059FE7
+ .quad 0x03FE568AA0194EC6E # 0.669026377763 975
+ .quad 0x03FE568AA0194EC6E
+ .quad 0x03FE56C91D71CF810 # 0.669503135817 976
+ .quad 0x03FE56C91D71CF810
+ .quad 0x03FE5707A26BB8C66 # 0.669980121278 977
+ .quad 0x03FE5707A26BB8C66
+ .quad 0x03FE57462F08E7DF5 # 0.670457334363 978
+ .quad 0x03FE57462F08E7DF5
+ .quad 0x03FE5784C34B3AC30 # 0.670934775289 979
+ .quad 0x03FE5784C34B3AC30
+ .quad 0x03FE57C35F3490183 # 0.671412444273 980
+ .quad 0x03FE57C35F3490183
+ .quad 0x03FE580202C6C7353 # 0.671890341535 981
+ .quad 0x03FE580202C6C7353
+ .quad 0x03FE5840AE03C0204 # 0.672368467291 982
+ .quad 0x03FE5840AE03C0204
+ .quad 0x03FE589EBD437CA31 # 0.673086084831 983
+ .quad 0x03FE589EBD437CA31
+ .quad 0x03FE58DD7BB392B30 # 0.673564782782 984
+ .quad 0x03FE58DD7BB392B30
+ .quad 0x03FE591C41D500163 # 0.674043709994 985
+ .quad 0x03FE591C41D500163
+ .quad 0x03FE595B0FA9A7EF1 # 0.674522866688 986
+ .quad 0x03FE595B0FA9A7EF1
+ .quad 0x03FE5999E5336E121 # 0.675002253082 987
+ .quad 0x03FE5999E5336E121
+ .quad 0x03FE59D8C2743705E # 0.675481869398 988
+ .quad 0x03FE59D8C2743705E
+ .quad 0x03FE5A17A76DE803B # 0.675961715857 989
+ .quad 0x03FE5A17A76DE803B
+ .quad 0x03FE5A56942266F7B # 0.676441792678 990
+ .quad 0x03FE5A56942266F7B
+ .quad 0x03FE5A9588939A810 # 0.676922100084 991
+ .quad 0x03FE5A9588939A810
+ .quad 0x03FE5AD484C369F2D # 0.677402638296 992
+ .quad 0x03FE5AD484C369F2D
+ .quad 0x03FE5B1388B3BD53E # 0.677883407536 993
+ .quad 0x03FE5B1388B3BD53E
+ .quad 0x03FE5B5294667D5F7 # 0.678364408027 994
+ .quad 0x03FE5B5294667D5F7
+ .quad 0x03FE5B91A7DD93852 # 0.678845639990 995
+ .quad 0x03FE5B91A7DD93852
+ .quad 0x03FE5BD0C31AE9E9D # 0.679327103649 996
+ .quad 0x03FE5BD0C31AE9E9D
+ .quad 0x03FE5C2F7A8ED5E5B # 0.680049734055 997
+ .quad 0x03FE5C2F7A8ED5E5B
+ .quad 0x03FE5C6EA94431EF9 # 0.680531777930 998
+ .quad 0x03FE5C6EA94431EF9
+ .quad 0x03FE5CADDFC6874F5 # 0.681014054284 999
+ .quad 0x03FE5CADDFC6874F5
+ .quad 0x03FE5CED1E17C35C6 # 0.681496563340 1000
+ .quad 0x03FE5CED1E17C35C6
+ .quad 0x03FE5D2C6439D4252 # 0.681979305324 1001
+ .quad 0x03FE5D2C6439D4252
+ .quad 0x03FE5D6BB22EA86F6 # 0.682462280460 1002
+ .quad 0x03FE5D6BB22EA86F6
+ .quad 0x03FE5DAB07F82FB84 # 0.682945488974 1003
+ .quad 0x03FE5DAB07F82FB84
+ .quad 0x03FE5DEA65985A350 # 0.683428931091 1004
+ .quad 0x03FE5DEA65985A350
+ .quad 0x03FE5E29CB1118D32 # 0.683912607038 1005
+ .quad 0x03FE5E29CB1118D32
+ .quad 0x03FE5E6938645D390 # 0.684396517040 1006
+ .quad 0x03FE5E6938645D390
+ .quad 0x03FE5EA8AD9419C5B # 0.684880661324 1007
+ .quad 0x03FE5EA8AD9419C5B
+ .quad 0x03FE5EE82AA241920 # 0.685365040118 1008
+ .quad 0x03FE5EE82AA241920
+ .quad 0x03FE5F27AF90C8705 # 0.685849653648 1009
+ .quad 0x03FE5F27AF90C8705
+ .quad 0x03FE5F673C61A2ED2 # 0.686334502142 1010
+ .quad 0x03FE5F673C61A2ED2
+ .quad 0x03FE5FA6D116C64F7 # 0.686819585829 1011
+ .quad 0x03FE5FA6D116C64F7
+ .quad 0x03FE5FE66DB228992 # 0.687304904936 1012
+ .quad 0x03FE5FE66DB228992
+ .quad 0x03FE60261235C0874 # 0.687790459692 1013
+ .quad 0x03FE60261235C0874
+ .quad 0x03FE6065BEA385926 # 0.688276250325 1014
+ .quad 0x03FE6065BEA385926
+ .quad 0x03FE60A572FD6FEF1 # 0.688762277066 1015
+ .quad 0x03FE60A572FD6FEF1
+ .quad 0x03FE60E52F45788E4 # 0.689248540144 1016
+ .quad 0x03FE60E52F45788E4
+ .quad 0x03FE6124F37D991D4 # 0.689735039789 1017
+ .quad 0x03FE6124F37D991D4
+ .quad 0x03FE6164BFA7CC06C # 0.690221776231 1018
+ .quad 0x03FE6164BFA7CC06C
+ .quad 0x03FE61A493C60C729 # 0.690708749700 1019
+ .quad 0x03FE61A493C60C729
+ .quad 0x03FE61E46FDA56466 # 0.691195960429 1020
+ .quad 0x03FE61E46FDA56466
+ .quad 0x03FE622453E6A6263 # 0.691683408647 1021
+ .quad 0x03FE622453E6A6263
+ .quad 0x03FE62643FECF9743 # 0.692171094587 1022
+ .quad 0x03FE62643FECF9743
+ .quad 0x03FE62A433EF4E51A # 0.692659018480 1023
+ .quad 0x03FE62A433EF4E51A
diff --git a/src/gas/vrda_scaledshifted_logr.S b/src/gas/vrda_scaledshifted_logr.S
new file mode 100644
index 0000000..960460d
--- /dev/null
+++ b/src/gas/vrda_scaledshifted_logr.S
@@ -0,0 +1,2451 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrda_scaledshifted_logr.s
+#
+# An array implementation of the log libm function.
+# Adapted to provide a scalingi and shifting factor. This routine is
+# used by the ACML RNG distribution functions.
+#
+# Prototype:
+#
+# void vrda_scaledshifted_logr(int n, double *x, double *y, double b,double a);
+#
+# Computes the natural log of x multiplied by b, plus a.
+# A reduced precision routine. Uses the intel novel reduction technique
+# with frcpai to compute logs.
+# Also uses only 3 polynomial terms to acheive52-18= 34 significant digits
+#
+# This specialized routine does not handle negative numbers, 0, NaNs, or infinity.
+# This routine is not C99 compliant
+# This version can compute logs in 26
+# cycles with n <= 24
+#
+#
+
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+.equ p_xexp,0x020 # index storage
+
+.equ p_x2,0x030 # temporary for error checking operation
+.equ p_idx2,0x040 # index storage
+.equ p_xexp2,0x050 # index storage
+
+.equ save_xa,0x060 #qword
+.equ save_ya,0x068 #qword
+.equ save_nv,0x070 #qword
+.equ p_iter,0x078 # qword storage for number of loop iterations
+
+
+
+.equ p2_temp,0x090 # second temporary for get/put bits operation
+
+
+
+.equ stack_size,0x0e8
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .weak vrda_scaledshifted_logr__
+ .set vrda_scaledshifted_logr__,__vrda_scaledshifted_logr__
+ .weak vrda_scaledshifted_logr_
+ .set vrda_scaledshifted_logr_,__vrda_scaledshifted_logr__
+
+# Fortran interface parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+# rcx - double *b
+# r8 - double *a
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array log
+#** VRDA_LOG(N,X,Y)
+# C equivalent*/
+#void vrda_scaledshifted_logr__(int * n, double *x, double *y,double *b, double *a)
+#{
+# vrda_scaledshifted_logr(*n,x,y);
+#}
+.globl __vrda_scaledshifted_logr__
+ .type __vrda_scaledshifted_logr__,@function
+__vrda_scaledshifted_logr__:
+ mov (%rdi),%edi
+ movlpd (%rcx),%xmm0
+ movlpd (%r8),%xmm1
+
+# C interface parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+# xmm0 - double b
+# xmm1 - double a
+
+ .align 16
+ .p2align 4,,15
+.globl vrda_scaledshifted_logr
+ .type vrda_scaledshifted_logr,@function
+vrda_scaledshifted_logr:
+ sub $stack_size,%rsp
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+# move the scale and shift factor to another register
+ movsd %xmm0,%xmm10
+ unpcklpd %xmm10,%xmm10
+ movsd %xmm1,%xmm11
+ unpcklpd %xmm11,%xmm11
+
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vda_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlpd (%rsi),%xmm0
+ movhpd 8(%rsi),%xmm0
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+ movlpd -16(%rsi),%xmm1
+ movhpd -8(%rsi),%xmm1
+
+# compute the logs
+
+# movdqa %xmm0,p_x(%rsp) # save the input values
+
+# use the algorithm referenced in the itanic trancendental paper.
+
+# reduction
+# compute r = x frcpa(x) - 1
+ movdqa %xmm0,%xmm8
+ movdqa %xmm1,%xmm9
+
+ call __vrd4_frcpa@PLT
+ movdqa %xmm8,%xmm4
+ movdqa %xmm9,%xmm7
+# invert the exponent
+ psllq $1,%xmm8
+ psllq $1,%xmm9
+ mulpd %xmm0,%xmm4 # r
+ mulpd %xmm1,%xmm7 # r
+ movdqa %xmm8,%xmm5
+ paddq .L__mask_rup(%rip),%xmm8
+ psrlq $53,%xmm8
+ movdqa %xmm9,%xmm6
+ paddq .L__mask_rup(%rip),%xmm6
+ psrlq $53,%xmm6
+ psubq .L__mask_3ff(%rip),%xmm8
+ psubq .L__mask_3ff(%rip),%xmm6
+ pshufd $0x058,%xmm8,%xmm8
+ pshufd $0x058,%xmm6,%xmm6
+
+
+ subpd .L__real_one(%rip),%xmm4
+ subpd .L__real_one(%rip),%xmm7
+
+ cvtdq2pd %xmm8,%xmm0 #N
+ cvtdq2pd %xmm6,%xmm1 #N
+# movdqa %xmm8,%xmm0
+# movdqa %xmm6,%xmm1
+# compute index for table lookup. if 1/2 bit set, increment the index+exponent
+ psrlq $42,%xmm5
+ psrlq $42,%xmm9
+ paddq .L__int_one(%rip),%xmm5
+ paddq .L__int_one(%rip),%xmm9
+ psrlq $1,%xmm5
+ psrlq $1,%xmm9
+ pand .L__mask_3ff(%rip),%xmm5
+ pand .L__mask_3ff(%rip),%xmm9
+ psllq $1,%xmm5
+ psllq $1,%xmm9
+
+ movdqa %xmm5,p_x(%rsp) # move the indexes to a memory location
+ movdqa %xmm9,p_x2(%rsp)
+
+
+ movapd .L__real_third(%rip),%xmm3
+ movdqa %xmm3,%xmm5
+ movapd %xmm4,%xmm2
+ movapd %xmm7,%xmm8
+
+# approximation
+# compute the polynomial
+# p(r) = p1r^2+p2r^3+p3r^4+p4r^5
+
+ mulpd %xmm4,%xmm2 #r^2
+ mulpd %xmm7,%xmm8 #r^2
+
+ mulpd %xmm4,%xmm3 # 1/3r
+ mulpd %xmm7,%xmm5 # 1/3r
+# lookup the f(k) term
+ lea .L__np_lnf_table(%rip),%rdx
+ mov p_x(%rsp),%rcx
+ mov p_x+8(%rsp),%r9
+ movlpd (%rdx,%rcx,8),%xmm6 # lookup
+ movhpd (%rdx,%r9,8),%xmm6 # lookup
+
+ addpd .L__real_half(%rip),%xmm3 # p2 + p3r
+ addpd .L__real_half(%rip),%xmm5 # p2 + p3r
+
+ mov p_x2(%rsp),%rcx
+ mov p_x2+8(%rsp),%r9
+ movlpd (%rdx,%rcx,8),%xmm9 # lookup
+ movhpd (%rdx,%r9,8),%xmm9 # lookup
+
+ mulpd %xmm3,%xmm2 # r2(p2 + p3r)
+ mulpd %xmm5,%xmm8 # r2(p2 + p3r)
+ addpd %xmm4,%xmm2 # +r
+ addpd %xmm7,%xmm8 # +r
+
+
+# reconstruction
+# compute ln(x) = T + r + p(r) where
+# T = N*ln(2)+ln(1/frcpa(x)) via tab of ln(1/frcpa(y)), where y = 1 + k/256, 0<=k<=255
+
+ mulpd .L__real_log2(%rip),%xmm0 # compute N*__real_log2
+ mulpd .L__real_log2(%rip),%xmm1 # compute N*__real_log2
+ addpd %xmm6,%xmm2 # add the new mantissas
+ addpd %xmm9,%xmm8 # add the new mantissas
+ addpd %xmm2,%xmm0
+ addpd %xmm8,%xmm1
+
+
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ mulpd %xmm10,%xmm0
+ addpd %xmm11,%xmm0
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+
+
+ prefetch 64(%rdi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ mulpd %xmm10,%xmm1
+ addpd %xmm11,%xmm1
+ movlpd %xmm1,-16(%rdi)
+ movhpd %xmm1,-8(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vda_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vda_cleanup
+
+
+.L__finish:
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+
+
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+# we assume that rdx is pointing at the next x array element,
+# r8 at the next y array element. The number of values left is in
+# save_nv
+.L__vda_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__finish # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorpd %xmm0,%xmm0
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd %xmm0,p_x+16(%rsp)
+
+ mov (%rsi),%rcx # we know there's at least one
+ mov %rcx,p_x(%rsp)
+ cmp $2,%rax
+ jl .L__vdacg
+
+ mov 8(%rsi),%rcx # do the second value
+ mov %rcx,p_x+8(%rsp)
+ cmp $3,%rax
+ jl .L__vdacg
+
+ mov 16(%rsi),%rcx # do the third value
+ mov %rcx,p_x+16(%rsp)
+
+.L__vdacg:
+ mov $4,%rdi # parameter for N
+ lea p_x(%rsp),%rsi # &x parameter
+ lea p2_temp(%rsp),%rdx # &y parameter
+ movsd %xmm10,%xmm0
+ movsd %xmm11,%xmm1
+ call vrda_scaledshifted_logr@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p2_temp(%rsp),%rcx
+ mov %rcx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+8(%rsp),%rcx
+ mov %rcx,8(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+16(%rsp),%rcx
+ mov %rcx,16(%rdi) # do the third value
+
+.L__vdacgf:
+ jmp .L__finish
+
+ .data
+ .align 64
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+
+.L__real_half: .quad 0x0bfe0000000000000 # 1/2
+ .quad 0x0bfe0000000000000
+.L__real_third: .quad 0x03fd5555555555555 # 1/3
+ .quad 0x03fd5555555555555
+.L__real_fourth: .quad 0x0bfd0000000000000 # 1/4
+ .quad 0x0bfd0000000000000
+.L__real_fifth: .quad 0x03fc999999999999a # 1/5
+ .quad 0x03fc999999999999a
+.L__real_sixth: .quad 0x0bfc5555555555555 # 1/6
+ .quad 0x0bfc5555555555555
+
+.L__real_log2: .quad 0x03FE62E42FEFA39EF # 0.693147182465
+ .quad 0x03FE62E42FEFA39EF
+
+.L__mask_3ff: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+
+.L__mask_rup: .quad 0x0000003fffffffffe
+ .quad 0x0000003fffffffffe
+
+.L__int_one: .quad 0x00000000000000001
+ .quad 0x00000000000000001
+
+
+
+.L__mask_10bits: .quad 0x000000000000003ff
+ .quad 0x000000000000003ff
+
+.L__mask_expext: .quad 0x000000000003ff000
+ .quad 0x000000000003ff000
+
+.L__mask_expext2: .quad 0x000000000003ff800
+ .quad 0x000000000003ff800
+
+
+
+
+.L__np_lnf_table:
+#log table Program - logtab.c
+#Built Jan 18 2006 09:51:57
+#Compiler version 1400
+
+ .quad 0x00000000000000000 # 0.000000000000 0
+ .quad 0x00000000000000000
+ .quad 0x03F50020055655885 # 0.000977039648 1
+ .quad 0x03F50020055655885
+ .quad 0x03F60040155D5881E # 0.001955034836 2
+ .quad 0x03F60040155D5881E
+ .quad 0x03F6809048289860A # 0.002933987435 3
+ .quad 0x03F6809048289860A
+ .quad 0x03F70080559588B25 # 0.003913899321 4
+ .quad 0x03F70080559588B25
+ .quad 0x03F740C8A7478788D # 0.004894772377 5
+ .quad 0x03F740C8A7478788D
+ .quad 0x03F78121214586B02 # 0.005876608489 6
+ .quad 0x03F78121214586B02
+ .quad 0x03F7C189CBB0E283F # 0.006859409551 7
+ .quad 0x03F7C189CBB0E283F
+ .quad 0x03F8010157588DE69 # 0.007843177461 8
+ .quad 0x03F8010157588DE69
+ .quad 0x03F82145E939EF1BC # 0.008827914124 9
+ .quad 0x03F82145E939EF1BC
+ .quad 0x03F83D8896A83D7A8 # 0.009690354884 10
+ .quad 0x03F83D8896A83D7A8
+ .quad 0x03F85DDC705054DFF # 0.010676913110 11
+ .quad 0x03F85DDC705054DFF
+ .quad 0x03F87E38762CA0C6D # 0.011664445593 12
+ .quad 0x03F87E38762CA0C6D
+ .quad 0x03F89E9CAC6007563 # 0.012652954261 13
+ .quad 0x03F89E9CAC6007563
+ .quad 0x03F8BF091710935A4 # 0.013642441046 14
+ .quad 0x03F8BF091710935A4
+ .quad 0x03F8DF7DBA6777895 # 0.014632907884 15
+ .quad 0x03F8DF7DBA6777895
+ .quad 0x03F8FBEA8B13C03F9 # 0.015500371846 16
+ .quad 0x03F8FBEA8B13C03F9
+ .quad 0x03F90E3751F24F45C # 0.016492681528 17
+ .quad 0x03F90E3751F24F45C
+ .quad 0x03F91E7D80B1FBF4C # 0.017485976867 18
+ .quad 0x03F91E7D80B1FBF4C
+ .quad 0x03F92CBE4F6CC56C3 # 0.018355920375 19
+ .quad 0x03F92CBE4F6CC56C3
+ .quad 0x03F93D0C443D7258C # 0.019351069108 20
+ .quad 0x03F93D0C443D7258C
+ .quad 0x03F94D5E6176ACC89 # 0.020347209148 21
+ .quad 0x03F94D5E6176ACC89
+ .quad 0x03F95DB4A937DEF10 # 0.021344342472 22
+ .quad 0x03F95DB4A937DEF10
+ .quad 0x03F96C039490E37F4 # 0.022217650494 23
+ .quad 0x03F96C039490E37F4
+ .quad 0x03F97C61B1CF5DED7 # 0.023216651576 24
+ .quad 0x03F97C61B1CF5DED7
+ .quad 0x03F98AB77B3FD6EAD # 0.024091596947 25
+ .quad 0x03F98AB77B3FD6EAD
+ .quad 0x03F99B1D75828E780 # 0.025092472797 26
+ .quad 0x03F99B1D75828E780
+ .quad 0x03F9AB87A478CB7CB # 0.026094351403 27
+ .quad 0x03F9AB87A478CB7CB
+ .quad 0x03F9B9E8027E1916F # 0.026971819338 28
+ .quad 0x03F9B9E8027E1916F
+ .quad 0x03F9CA5A1A18613E6 # 0.027975583538 29
+ .quad 0x03F9CA5A1A18613E6
+ .quad 0x03F9D8C1670325921 # 0.028854704473 30
+ .quad 0x03F9D8C1670325921
+ .quad 0x03F9E93B6EE41F674 # 0.029860361378 31
+ .quad 0x03F9E93B6EE41F674
+ .quad 0x03F9F7A9B16782855 # 0.030741141554 32
+ .quad 0x03F9F7A9B16782855
+ .quad 0x03FA0415D89E74440 # 0.031748698315 33
+ .quad 0x03FA0415D89E74440
+ .quad 0x03FA0C58FA19DFAAB # 0.032757271269 34
+ .quad 0x03FA0C58FA19DFAAB
+ .quad 0x03FA139577CC41C1A # 0.033640607815 35
+ .quad 0x03FA139577CC41C1A
+ .quad 0x03FA1AD398C6CD57C # 0.034524725334 36
+ .quad 0x03FA1AD398C6CD57C
+ .quad 0x03FA231C9C40E204E # 0.035536103423 37
+ .quad 0x03FA231C9C40E204E
+ .quad 0x03FA2A5E4231CF7BD # 0.036421899115 38
+ .quad 0x03FA2A5E4231CF7BD
+ .quad 0x03FA32AB4D4C59CB0 # 0.037435198758 39
+ .quad 0x03FA32AB4D4C59CB0
+ .quad 0x03FA39F07BA0EBD5A # 0.038322679007 40
+ .quad 0x03FA39F07BA0EBD5A
+ .quad 0x03FA424192495D571 # 0.039337907520 41
+ .quad 0x03FA424192495D571
+ .quad 0x03FA498A4C73DA65D # 0.040227078744 42
+ .quad 0x03FA498A4C73DA65D
+ .quad 0x03FA50D4AF75CA86F # 0.041117041297 43
+ .quad 0x03FA50D4AF75CA86F
+ .quad 0x03FA592BBC15215BC # 0.042135112141 44
+ .quad 0x03FA592BBC15215BC
+ .quad 0x03FA6079B00423FF6 # 0.043026775152 45
+ .quad 0x03FA6079B00423FF6
+ .quad 0x03FA67C94F2D4BB65 # 0.043919233935 46
+ .quad 0x03FA67C94F2D4BB65
+ .quad 0x03FA70265A550E77B # 0.044940163069 47
+ .quad 0x03FA70265A550E77B
+ .quad 0x03FA77798F8D6DFDC # 0.045834331871 48
+ .quad 0x03FA77798F8D6DFDC
+ .quad 0x03FA7ECE7267CD123 # 0.046729300926 49
+ .quad 0x03FA7ECE7267CD123
+ .quad 0x03FA873184BC09586 # 0.047753104446 50
+ .quad 0x03FA873184BC09586
+ .quad 0x03FA8E8A02D2E3175 # 0.048649793163 51
+ .quad 0x03FA8E8A02D2E3175
+ .quad 0x03FA95E430F8CE456 # 0.049547286652 52
+ .quad 0x03FA95E430F8CE456
+ .quad 0x03FA9D400FF482586 # 0.050445586359 53
+ .quad 0x03FA9D400FF482586
+ .quad 0x03FAA5AB21CB34A9E # 0.051473203662 54
+ .quad 0x03FAA5AB21CB34A9E
+ .quad 0x03FAAD0AA2E784EF4 # 0.052373235867 55
+ .quad 0x03FAAD0AA2E784EF4
+ .quad 0x03FAB46BD74DA76A0 # 0.053274078860 56
+ .quad 0x03FAB46BD74DA76A0
+ .quad 0x03FABBCEBFC68F424 # 0.054175734102 57
+ .quad 0x03FABBCEBFC68F424
+ .quad 0x03FAC3335D1BBAE4D # 0.055078203060 58
+ .quad 0x03FAC3335D1BBAE4D
+ .quad 0x03FACBA87200EB8F1 # 0.056110594428 59
+ .quad 0x03FACBA87200EB8F1
+ .quad 0x03FAD310BA20455A2 # 0.057014812019 60
+ .quad 0x03FAD310BA20455A2
+ .quad 0x03FADA7AB998B77ED # 0.057919847959 61
+ .quad 0x03FADA7AB998B77ED
+ .quad 0x03FAE1E6713606CFB # 0.058825703731 62
+ .quad 0x03FAE1E6713606CFB
+ .quad 0x03FAE953E1C48603A # 0.059732380822 63
+ .quad 0x03FAE953E1C48603A
+ .quad 0x03FAF0C30C1116351 # 0.060639880722 64
+ .quad 0x03FAF0C30C1116351
+ .quad 0x03FAF833F0E927711 # 0.061548204926 65
+ .quad 0x03FAF833F0E927711
+ .quad 0x03FAFFA6911AB9309 # 0.062457354934 66
+ .quad 0x03FAFFA6911AB9309
+ .quad 0x03FB038D76BA2D737 # 0.063367332247 67
+ .quad 0x03FB038D76BA2D737
+ .quad 0x03FB0748836296412 # 0.064278138373 68
+ .quad 0x03FB0748836296412
+ .quad 0x03FB0B046EEE6F7A4 # 0.065189774824 69
+ .quad 0x03FB0B046EEE6F7A4
+ .quad 0x03FB0EC139C5DA5FD # 0.066102243114 70
+ .quad 0x03FB0EC139C5DA5FD
+ .quad 0x03FB127EE451413A8 # 0.067015544762 71
+ .quad 0x03FB127EE451413A8
+ .quad 0x03FB163D6EF9579FC # 0.067929681294 72
+ .quad 0x03FB163D6EF9579FC
+ .quad 0x03FB19FCDA271ABC0 # 0.068844654235 73
+ .quad 0x03FB19FCDA271ABC0
+ .quad 0x03FB1DBD2643D1912 # 0.069760465119 74
+ .quad 0x03FB1DBD2643D1912
+ .quad 0x03FB217E53B90D3CE # 0.070677115481 75
+ .quad 0x03FB217E53B90D3CE
+ .quad 0x03FB254062F0A9417 # 0.071594606862 76
+ .quad 0x03FB254062F0A9417
+ .quad 0x03FB29035454CBCB0 # 0.072512940806 77
+ .quad 0x03FB29035454CBCB0
+ .quad 0x03FB2CC7284FE5F1A # 0.073432118863 78
+ .quad 0x03FB2CC7284FE5F1A
+ .quad 0x03FB308BDF4CB4062 # 0.074352142586 79
+ .quad 0x03FB308BDF4CB4062
+ .quad 0x03FB345179B63DD3F # 0.075273013532 80
+ .quad 0x03FB345179B63DD3F
+ .quad 0x03FB3817F7F7D6EAB # 0.076194733263 81
+ .quad 0x03FB3817F7F7D6EAB
+ .quad 0x03FB3BDF5A7D1EE5E # 0.077117303344 82
+ .quad 0x03FB3BDF5A7D1EE5E
+ .quad 0x03FB3F1D405CE86D3 # 0.077908755701 83
+ .quad 0x03FB3F1D405CE86D3
+ .quad 0x03FB42E64BEC266E4 # 0.078832909176 84
+ .quad 0x03FB42E64BEC266E4
+ .quad 0x03FB46B03CF437BC4 # 0.079757917501 85
+ .quad 0x03FB46B03CF437BC4
+ .quad 0x03FB4A7B13E1E3E65 # 0.080683782259 86
+ .quad 0x03FB4A7B13E1E3E65
+ .quad 0x03FB4E46D1223FE84 # 0.081610505036 87
+ .quad 0x03FB4E46D1223FE84
+ .quad 0x03FB52137522AE732 # 0.082538087426 88
+ .quad 0x03FB52137522AE732
+ .quad 0x03FB5555DE434F2A0 # 0.083333843436 89
+ .quad 0x03FB5555DE434F2A0
+ .quad 0x03FB59242FF043D34 # 0.084263026485 90
+ .quad 0x03FB59242FF043D34
+ .quad 0x03FB5CF36997817B2 # 0.085193073719 91
+ .quad 0x03FB5CF36997817B2
+ .quad 0x03FB60C38BA799459 # 0.086123986746 92
+ .quad 0x03FB60C38BA799459
+ .quad 0x03FB6408F471C82A2 # 0.086922602521 93
+ .quad 0x03FB6408F471C82A2
+ .quad 0x03FB67DAC7466CB96 # 0.087855127734 94
+ .quad 0x03FB67DAC7466CB96
+ .quad 0x03FB6BAD83C1883BA # 0.088788523361 95
+ .quad 0x03FB6BAD83C1883BA
+ .quad 0x03FB6EF528C056A2D # 0.089589270768 96
+ .quad 0x03FB6EF528C056A2D
+ .quad 0x03FB72C9985035BB1 # 0.090524287199 97
+ .quad 0x03FB72C9985035BB1
+ .quad 0x03FB769EF2C6B5688 # 0.091460178704 98
+ .quad 0x03FB769EF2C6B5688
+ .quad 0x03FB79E8D70A364C6 # 0.092263069152 99
+ .quad 0x03FB79E8D70A364C6
+ .quad 0x03FB7DBFE6EA733FE # 0.093200590148 100
+ .quad 0x03FB7DBFE6EA733FE
+ .quad 0x03FB8197E2F40E3F0 # 0.094138990914 101
+ .quad 0x03FB8197E2F40E3F0
+ .quad 0x03FB84E40992A4804 # 0.094944035906 102
+ .quad 0x03FB84E40992A4804
+ .quad 0x03FB88BDBD5FC66D2 # 0.095884074919 103
+ .quad 0x03FB88BDBD5FC66D2
+ .quad 0x03FB8C985E9B9EC7E # 0.096824998438 104
+ .quad 0x03FB8C985E9B9EC7E
+ .quad 0x03FB8FE6CAB20E979 # 0.097632209567 105
+ .quad 0x03FB8FE6CAB20E979
+ .quad 0x03FB93C3261014C65 # 0.098574780162 106
+ .quad 0x03FB93C3261014C65
+ .quad 0x03FB97130DC9235DE # 0.099383405543 107
+ .quad 0x03FB97130DC9235DE
+ .quad 0x03FB9AF124D64C623 # 0.100327628989 108
+ .quad 0x03FB9AF124D64C623
+ .quad 0x03FB9E4289871E964 # 0.101137673586 109
+ .quad 0x03FB9E4289871E964
+ .quad 0x03FBA2225DD276FCB # 0.102083555691 110
+ .quad 0x03FBA2225DD276FCB
+ .quad 0x03FBA57540D1FE441 # 0.102895024494 111
+ .quad 0x03FBA57540D1FE441
+ .quad 0x03FBA956D3ECADE60 # 0.103842571097 112
+ .quad 0x03FBA956D3ECADE60
+ .quad 0x03FBACAB3693AB9C0 # 0.104655469123 113
+ .quad 0x03FBACAB3693AB9C0
+ .quad 0x03FBB08E8A10F96F4 # 0.105604686090 114
+ .quad 0x03FBB08E8A10F96F4
+ .quad 0x03FBB3E46DBA02181 # 0.106419018383 115
+ .quad 0x03FBB3E46DBA02181
+ .quad 0x03FBB7C9832F58018 # 0.107369911615 116
+ .quad 0x03FBB7C9832F58018
+ .quad 0x03FBBB20E936D6976 # 0.108185683244 117
+ .quad 0x03FBBB20E936D6976
+ .quad 0x03FBBF07C23BC54EA # 0.109138258671 118
+ .quad 0x03FBBF07C23BC54EA
+ .quad 0x03FBC260ABFFFE972 # 0.109955474734 119
+ .quad 0x03FBC260ABFFFE972
+ .quad 0x03FBC6494A2E418A0 # 0.110909738320 120
+ .quad 0x03FBC6494A2E418A0
+ .quad 0x03FBC9A3B90F57748 # 0.111728403941 121
+ .quad 0x03FBC9A3B90F57748
+ .quad 0x03FBCCFEDBFEE13A8 # 0.112547740324 122
+ .quad 0x03FBCCFEDBFEE13A8
+ .quad 0x03FBD0EA1362CDBFC # 0.113504482008 123
+ .quad 0x03FBD0EA1362CDBFC
+ .quad 0x03FBD446BD753D433 # 0.114325275488 124
+ .quad 0x03FBD446BD753D433
+ .quad 0x03FBD7A41C8627307 # 0.115146743223 125
+ .quad 0x03FBD7A41C8627307
+ .quad 0x03FBDB91F09680DF9 # 0.116105975911 126
+ .quad 0x03FBDB91F09680DF9
+ .quad 0x03FBDEF0D8D466DBB # 0.116928908339 127
+ .quad 0x03FBDEF0D8D466DBB
+ .quad 0x03FBE2507702AF03B # 0.117752518544 128
+ .quad 0x03FBE2507702AF03B
+ .quad 0x03FBE640EB3D2B411 # 0.118714255240 129
+ .quad 0x03FBE640EB3D2B411
+ .quad 0x03FBE9A214A69DD58 # 0.119539337795 130
+ .quad 0x03FBE9A214A69DD58
+ .quad 0x03FBED03F4F440969 # 0.120365101673 131
+ .quad 0x03FBED03F4F440969
+ .quad 0x03FBF0F70CDD992E4 # 0.121329355484 132
+ .quad 0x03FBF0F70CDD992E4
+ .quad 0x03FBF45A7A78B7C3B # 0.122156599431 133
+ .quad 0x03FBF45A7A78B7C3B
+ .quad 0x03FBF7BE9FEDBFDED # 0.122984528276 134
+ .quad 0x03FBF7BE9FEDBFDED
+ .quad 0x03FBFB237D8AB13FB # 0.123813143156 135
+ .quad 0x03FBFB237D8AB13FB
+ .quad 0x03FBFF1A13EAC95FD # 0.124780729104 136
+ .quad 0x03FBFF1A13EAC95FD
+ .quad 0x03FC014040CAB0229 # 0.125610834299 137
+ .quad 0x03FC014040CAB0229
+ .quad 0x03FC02F3D4301417B # 0.126441629140 138
+ .quad 0x03FC02F3D4301417B
+ .quad 0x03FC04A7C44CF87A4 # 0.127273114776 139
+ .quad 0x03FC04A7C44CF87A4
+ .quad 0x03FC06A4D1D26C5E9 # 0.128244055971 140
+ .quad 0x03FC06A4D1D26C5E9
+ .quad 0x03FC08598B59E3A07 # 0.129077042275 141
+ .quad 0x03FC08598B59E3A07
+ .quad 0x03FC0A0EA2164AF02 # 0.129910723024 142
+ .quad 0x03FC0A0EA2164AF02
+ .quad 0x03FC0BC4162F73B66 # 0.130745099376 143
+ .quad 0x03FC0BC4162F73B66
+ .quad 0x03FC0D79E7CD48E58 # 0.131580172493 144
+ .quad 0x03FC0D79E7CD48E58
+ .quad 0x03FC0F301717CF0FB # 0.132415943541 145
+ .quad 0x03FC0F301717CF0FB
+ .quad 0x03FC10E6A437247B7 # 0.133252413686 146
+ .quad 0x03FC10E6A437247B7
+ .quad 0x03FC12E6BFA8FEAD6 # 0.134229180665 147
+ .quad 0x03FC12E6BFA8FEAD6
+ .quad 0x03FC149E189F8642E # 0.135067169541 148
+ .quad 0x03FC149E189F8642E
+ .quad 0x03FC1655CFEA923A4 # 0.135905861231 149
+ .quad 0x03FC1655CFEA923A4
+ .quad 0x03FC180DE5B2ACE5C # 0.136745256915 150
+ .quad 0x03FC180DE5B2ACE5C
+ .quad 0x03FC19C65A207AC07 # 0.137585357777 151
+ .quad 0x03FC19C65A207AC07
+ .quad 0x03FC1B7F2D5CBA842 # 0.138426165001 152
+ .quad 0x03FC1B7F2D5CBA842
+ .quad 0x03FC1D385F90453F2 # 0.139267679777 153
+ .quad 0x03FC1D385F90453F2
+ .quad 0x03FC1EF1F0E40E6CD # 0.140109903297 154
+ .quad 0x03FC1EF1F0E40E6CD
+ .quad 0x03FC20ABE18124098 # 0.140952836755 155
+ .quad 0x03FC20ABE18124098
+ .quad 0x03FC22663190AEACC # 0.141796481350 156
+ .quad 0x03FC22663190AEACC
+ .quad 0x03FC2420E13BF19E3 # 0.142640838281 157
+ .quad 0x03FC2420E13BF19E3
+ .quad 0x03FC25DBF0AC4AED2 # 0.143485908754 158
+ .quad 0x03FC25DBF0AC4AED2
+ .quad 0x03FC2797600B3387B # 0.144331693975 159
+ .quad 0x03FC2797600B3387B
+ .quad 0x03FC29532F823F525 # 0.145178195155 160
+ .quad 0x03FC29532F823F525
+ .quad 0x03FC2B0F5F3B1D3EF # 0.146025413505 161
+ .quad 0x03FC2B0F5F3B1D3EF
+ .quad 0x03FC2CCBEF5F97653 # 0.146873350243 162
+ .quad 0x03FC2CCBEF5F97653
+ .quad 0x03FC2E88E01993187 # 0.147722006588 163
+ .quad 0x03FC2E88E01993187
+ .quad 0x03FC3046319311009 # 0.148571383763 164
+ .quad 0x03FC3046319311009
+ .quad 0x03FC3203E3F62D328 # 0.149421482992 165
+ .quad 0x03FC3203E3F62D328
+ .quad 0x03FC33C1F76D1F469 # 0.150272305505 166
+ .quad 0x03FC33C1F76D1F469
+ .quad 0x03FC35806C223A70F # 0.151123852534 167
+ .quad 0x03FC35806C223A70F
+ .quad 0x03FC373F423FED9A1 # 0.151976125313 168
+ .quad 0x03FC373F423FED9A1
+ .quad 0x03FC38FE79F0C3771 # 0.152829125080 169
+ .quad 0x03FC38FE79F0C3771
+ .quad 0x03FC3ABE135F62A12 # 0.153682853077 170
+ .quad 0x03FC3ABE135F62A12
+ .quad 0x03FC3C335E0447D71 # 0.154394850259 171
+ .quad 0x03FC3C335E0447D71
+ .quad 0x03FC3DF3AB13505F9 # 0.155249916579 172
+ .quad 0x03FC3DF3AB13505F9
+ .quad 0x03FC3FB45A59928CA # 0.156105714663 173
+ .quad 0x03FC3FB45A59928CA
+ .quad 0x03FC41756C0220C81 # 0.156962245765 174
+ .quad 0x03FC41756C0220C81
+ .quad 0x03FC4336E03829D61 # 0.157819511141 175
+ .quad 0x03FC4336E03829D61
+ .quad 0x03FC44F8B726F8EFE # 0.158677512051 176
+ .quad 0x03FC44F8B726F8EFE
+ .quad 0x03FC46BAF0F9F5DB8 # 0.159536249760 177
+ .quad 0x03FC46BAF0F9F5DB8
+ .quad 0x03FC48326CD3EC797 # 0.160252428262 178
+ .quad 0x03FC48326CD3EC797
+ .quad 0x03FC49F55C6502F81 # 0.161112520058 179
+ .quad 0x03FC49F55C6502F81
+ .quad 0x03FC4BB8AF55DE908 # 0.161973352249 180
+ .quad 0x03FC4BB8AF55DE908
+ .quad 0x03FC4D7C65D25566D # 0.162834926111 181
+ .quad 0x03FC4D7C65D25566D
+ .quad 0x03FC4F4080065AA7F # 0.163697242922 182
+ .quad 0x03FC4F4080065AA7F
+ .quad 0x03FC50B98CD30A759 # 0.164416408720 183
+ .quad 0x03FC50B98CD30A759
+ .quad 0x03FC527E5E4A1B58D # 0.165280090939 184
+ .quad 0x03FC527E5E4A1B58D
+ .quad 0x03FC544393F5DF80F # 0.166144519750 185
+ .quad 0x03FC544393F5DF80F
+ .quad 0x03FC56092E02BA514 # 0.167009696444 186
+ .quad 0x03FC56092E02BA514
+ .quad 0x03FC57837B3098F2C # 0.167731249257 187
+ .quad 0x03FC57837B3098F2C
+ .quad 0x03FC5949CDB873419 # 0.168597800437 188
+ .quad 0x03FC5949CDB873419
+ .quad 0x03FC5B10851FC924A # 0.169465103180 189
+ .quad 0x03FC5B10851FC924A
+ .quad 0x03FC5C8BC079D8289 # 0.170188430518 190
+ .quad 0x03FC5C8BC079D8289
+ .quad 0x03FC5E533144C1718 # 0.171057114516 191
+ .quad 0x03FC5E533144C1718
+ .quad 0x03FC601B076E7A8A8 # 0.171926553783 192
+ .quad 0x03FC601B076E7A8A8
+ .quad 0x03FC619732215D786 # 0.172651664394 193
+ .quad 0x03FC619732215D786
+ .quad 0x03FC635FC298F6C77 # 0.173522491735 194
+ .quad 0x03FC635FC298F6C77
+ .quad 0x03FC6528B8EFA5D16 # 0.174394078077 195
+ .quad 0x03FC6528B8EFA5D16
+ .quad 0x03FC66A5D42A3AD33 # 0.175120980777 196
+ .quad 0x03FC66A5D42A3AD33
+ .quad 0x03FC686F85BAD4298 # 0.175993962063 197
+ .quad 0x03FC686F85BAD4298
+ .quad 0x03FC6A399DABBD383 # 0.176867706111 198
+ .quad 0x03FC6A399DABBD383
+ .quad 0x03FC6BB7AA9F22C40 # 0.177596409780 199
+ .quad 0x03FC6BB7AA9F22C40
+ .quad 0x03FC6D827EB7C1E57 # 0.178471555693 200
+ .quad 0x03FC6D827EB7C1E57
+ .quad 0x03FC6F0128B756AB9 # 0.179201429458 201
+ .quad 0x03FC6F0128B756AB9
+ .quad 0x03FC70CCB9927BCF6 # 0.180077981742 202
+ .quad 0x03FC70CCB9927BCF6
+ .quad 0x03FC7298B1A4E32B6 # 0.180955303044 203
+ .quad 0x03FC7298B1A4E32B6
+ .quad 0x03FC74184F58CC7DC # 0.181686992547 204
+ .quad 0x03FC74184F58CC7DC
+ .quad 0x03FC75E5051E74141 # 0.182565727226 205
+ .quad 0x03FC75E5051E74141
+ .quad 0x03FC77654128F6127 # 0.183298596442 206
+ .quad 0x03FC77654128F6127
+ .quad 0x03FC7932B53E97639 # 0.184178749058 207
+ .quad 0x03FC7932B53E97639
+ .quad 0x03FC7AB390229D8FD # 0.184912801796 208
+ .quad 0x03FC7AB390229D8FD
+ .quad 0x03FC7C81C325B4A5E # 0.185794376934 209
+ .quad 0x03FC7C81C325B4A5E
+ .quad 0x03FC7E033D66CD24A # 0.186529617023 210
+ .quad 0x03FC7E033D66CD24A
+ .quad 0x03FC7FD22FF599D4C # 0.187412619288 211
+ .quad 0x03FC7FD22FF599D4C
+ .quad 0x03FC81544A17F67C1 # 0.188149050576 212
+ .quad 0x03FC81544A17F67C1
+ .quad 0x03FC8323FCD17DAC8 # 0.189033484595 213
+ .quad 0x03FC8323FCD17DAC8
+ .quad 0x03FC84A6B759F512D # 0.189771110947 214
+ .quad 0x03FC84A6B759F512D
+ .quad 0x03FC86772ADE0201C # 0.190656981373 215
+ .quad 0x03FC86772ADE0201C
+ .quad 0x03FC87FA865210911 # 0.191395806674 216
+ .quad 0x03FC87FA865210911
+ .quad 0x03FC89CBBB4136201 # 0.192283118179 217
+ .quad 0x03FC89CBBB4136201
+ .quad 0x03FC8B4FB826FF291 # 0.193023146334 218
+ .quad 0x03FC8B4FB826FF291
+ .quad 0x03FC8D21AF2299298 # 0.193911903613 219
+ .quad 0x03FC8D21AF2299298
+ .quad 0x03FC8EA64E00E7FC0 # 0.194653138545 220
+ .quad 0x03FC8EA64E00E7FC0
+ .quad 0x03FC902B36AB7681D # 0.195394923313 221
+ .quad 0x03FC902B36AB7681D
+ .quad 0x03FC91FE49096581E # 0.196285791969 222
+ .quad 0x03FC91FE49096581E
+ .quad 0x03FC9383D471B869B # 0.197028789254 223
+ .quad 0x03FC9383D471B869B
+ .quad 0x03FC9557AA6B87F65 # 0.197921115309 224
+ .quad 0x03FC9557AA6B87F65
+ .quad 0x03FC96DDD91A0B959 # 0.198665329082 225
+ .quad 0x03FC96DDD91A0B959
+ .quad 0x03FC9864522D04491 # 0.199410097121 226
+ .quad 0x03FC9864522D04491
+ .quad 0x03FC9A3945D1A44B3 # 0.200304551564 227
+ .quad 0x03FC9A3945D1A44B3
+ .quad 0x03FC9BC062F26FC3B # 0.201050541900 228
+ .quad 0x03FC9BC062F26FC3B
+ .quad 0x03FC9D47CAD2C1871 # 0.201797089154 229
+ .quad 0x03FC9D47CAD2C1871
+ .quad 0x03FC9F1DDD7FE4F8B # 0.202693682161 230
+ .quad 0x03FC9F1DDD7FE4F8B
+ .quad 0x03FCA0A5EA371A910 # 0.203441457564 231
+ .quad 0x03FCA0A5EA371A910
+ .quad 0x03FCA22E42098F498 # 0.204189792554 232
+ .quad 0x03FCA22E42098F498
+ .quad 0x03FCA405751F6CCE4 # 0.205088534376 233
+ .quad 0x03FCA405751F6CCE4
+ .quad 0x03FCA58E729348F40 # 0.205838103409 234
+ .quad 0x03FCA58E729348F40
+ .quad 0x03FCA717BB7EC64A3 # 0.206588234717 235
+ .quad 0x03FCA717BB7EC64A3
+ .quad 0x03FCA8F010601E5FD # 0.207489135679 236
+ .quad 0x03FCA8F010601E5FD
+ .quad 0x03FCAA79FFB8FCD48 # 0.208240506966 237
+ .quad 0x03FCAA79FFB8FCD48
+ .quad 0x03FCAC043AE68965A # 0.208992443238 238
+ .quad 0x03FCAC043AE68965A
+ .quad 0x03FCAD8EC205FB6AD # 0.209744945343 239
+ .quad 0x03FCAD8EC205FB6AD
+ .quad 0x03FCAF6895610DBAD # 0.210648695969 240
+ .quad 0x03FCAF6895610DBAD
+ .quad 0x03FCB0F3C3FBD65C9 # 0.211402445910 241
+ .quad 0x03FCB0F3C3FBD65C9
+ .quad 0x03FCB27F3EE674219 # 0.212156764419 242
+ .quad 0x03FCB27F3EE674219
+ .quad 0x03FCB40B063E65B0F # 0.212911652354 243
+ .quad 0x03FCB40B063E65B0F
+ .quad 0x03FCB5E65A8096C88 # 0.213818270730 244
+ .quad 0x03FCB5E65A8096C88
+ .quad 0x03FCB772CA646760C # 0.214574414434 245
+ .quad 0x03FCB772CA646760C
+ .quad 0x03FCB8FF871461198 # 0.215331130323 246
+ .quad 0x03FCB8FF871461198
+ .quad 0x03FCBA8C90AE4AD19 # 0.216088419265 247
+ .quad 0x03FCBA8C90AE4AD19
+ .quad 0x03FCBC19E74FFCBDA # 0.216846282128 248
+ .quad 0x03FCBC19E74FFCBDA
+ .quad 0x03FCBDF71B83DAE7A # 0.217756476365 249
+ .quad 0x03FCBDF71B83DAE7A
+ .quad 0x03FCBF851C067555C # 0.218515604922 250
+ .quad 0x03FCBF851C067555C
+ .quad 0x03FCC11369F0CDB3C # 0.219275310193 251
+ .quad 0x03FCC11369F0CDB3C
+ .quad 0x03FCC2A205610593E # 0.220035593055 252
+ .quad 0x03FCC2A205610593E
+ .quad 0x03FCC430EE755023B # 0.220796454387 253
+ .quad 0x03FCC430EE755023B
+ .quad 0x03FCC5C0254BF23A8 # 0.221557895069 254
+ .quad 0x03FCC5C0254BF23A8
+ .quad 0x03FCC79F9AB632BF1 # 0.222472389875 255
+ .quad 0x03FCC79F9AB632BF1
+ .quad 0x03FCC92F7D09ABE20 # 0.223235108240 256
+ .quad 0x03FCC92F7D09ABE20
+ .quad 0x03FCCABFAD80D023D # 0.223998408788 257
+ .quad 0x03FCCABFAD80D023D
+ .quad 0x03FCCC502C3A2F1E8 # 0.224762292410 258
+ .quad 0x03FCCC502C3A2F1E8
+ .quad 0x03FCCDE0F9546A5E7 # 0.225526759995 259
+ .quad 0x03FCCDE0F9546A5E7
+ .quad 0x03FCCF7214EE356E9 # 0.226291812439 260
+ .quad 0x03FCCF7214EE356E9
+ .quad 0x03FCD1037F2655E7B # 0.227057450635 261
+ .quad 0x03FCD1037F2655E7B
+ .quad 0x03FCD295381BA37E9 # 0.227823675483 262
+ .quad 0x03FCD295381BA37E9
+ .quad 0x03FCD4273FED08111 # 0.228590487882 263
+ .quad 0x03FCD4273FED08111
+ .quad 0x03FCD5B996B97FB5F # 0.229357888733 264
+ .quad 0x03FCD5B996B97FB5F
+ .quad 0x03FCD74C3CA018C9C # 0.230125878940 265
+ .quad 0x03FCD74C3CA018C9C
+ .quad 0x03FCD8DF31BFF3FF2 # 0.230894459410 266
+ .quad 0x03FCD8DF31BFF3FF2
+ .quad 0x03FCDA727638446A1 # 0.231663631050 267
+ .quad 0x03FCDA727638446A1
+ .quad 0x03FCDC56CAE452F5B # 0.232587418645 268
+ .quad 0x03FCDC56CAE452F5B
+ .quad 0x03FCDDEABE5A3926E # 0.233357894066 269
+ .quad 0x03FCDDEABE5A3926E
+ .quad 0x03FCDF7F018CE771F # 0.234128963578 270
+ .quad 0x03FCDF7F018CE771F
+ .quad 0x03FCE113949BDEC62 # 0.234900628096 271
+ .quad 0x03FCE113949BDEC62
+ .quad 0x03FCE2A877A6B2C0F # 0.235672888541 272
+ .quad 0x03FCE2A877A6B2C0F
+ .quad 0x03FCE43DAACD09BEC # 0.236445745833 273
+ .quad 0x03FCE43DAACD09BEC
+ .quad 0x03FCE5D32E2E9CE87 # 0.237219200895 274
+ .quad 0x03FCE5D32E2E9CE87
+ .quad 0x03FCE76901EB38427 # 0.237993254653 275
+ .quad 0x03FCE76901EB38427
+ .quad 0x03FCE8ADE53F76866 # 0.238612929343 276
+ .quad 0x03FCE8ADE53F76866
+ .quad 0x03FCEA4449F04AAF4 # 0.239388063093 277
+ .quad 0x03FCEA4449F04AAF4
+ .quad 0x03FCEBDAFF5593E99 # 0.240163798141 278
+ .quad 0x03FCEBDAFF5593E99
+ .quad 0x03FCED72058F666C5 # 0.240940135421 279
+ .quad 0x03FCED72058F666C5
+ .quad 0x03FCEF095CBDE9937 # 0.241717075868 280
+ .quad 0x03FCEF095CBDE9937
+ .quad 0x03FCF0A1050157ED6 # 0.242494620422 281
+ .quad 0x03FCF0A1050157ED6
+ .quad 0x03FCF238FE79FF4BF # 0.243272770021 282
+ .quad 0x03FCF238FE79FF4BF
+ .quad 0x03FCF3D1494840D2F # 0.244051525609 283
+ .quad 0x03FCF3D1494840D2F
+ .quad 0x03FCF569E58C91077 # 0.244830888130 284
+ .quad 0x03FCF569E58C91077
+ .quad 0x03FCF702D36777DF0 # 0.245610858531 285
+ .quad 0x03FCF702D36777DF0
+ .quad 0x03FCF89C12F990D0C # 0.246391437760 286
+ .quad 0x03FCF89C12F990D0C
+ .quad 0x03FCFA35A4638AE2C # 0.247172626770 287
+ .quad 0x03FCFA35A4638AE2C
+ .quad 0x03FCFB7D86EEE3B92 # 0.247798017660 288
+ .quad 0x03FCFB7D86EEE3B92
+ .quad 0x03FCFD17ABFCDB683 # 0.248580306677 289
+ .quad 0x03FCFD17ABFCDB683
+ .quad 0x03FCFEB2233EA07CB # 0.249363208150 290
+ .quad 0x03FCFEB2233EA07CB
+ .quad 0x03FD0026766A9671C # 0.250146723037 291
+ .quad 0x03FD0026766A9671C
+ .quad 0x03FD00F40470C7323 # 0.250930852302 292
+ .quad 0x03FD00F40470C7323
+ .quad 0x03FD01C1BBC2735A3 # 0.251715596908 293
+ .quad 0x03FD01C1BBC2735A3
+ .quad 0x03FD028F9C7035C1D # 0.252500957822 294
+ .quad 0x03FD028F9C7035C1D
+ .quad 0x03FD03346E0106062 # 0.253129690945 295
+ .quad 0x03FD03346E0106062
+ .quad 0x03FD0402994B4F041 # 0.253916163656 296
+ .quad 0x03FD0402994B4F041
+ .quad 0x03FD04D0EE20620AF # 0.254703255393 297
+ .quad 0x03FD04D0EE20620AF
+ .quad 0x03FD059F6C910034D # 0.255490967131 298
+ .quad 0x03FD059F6C910034D
+ .quad 0x03FD066E14ADF4BFD # 0.256279299848 299
+ .quad 0x03FD066E14ADF4BFD
+ .quad 0x03FD07138604D5864 # 0.256910413785 300
+ .quad 0x03FD07138604D5864
+ .quad 0x03FD07E2794F3E8C1 # 0.257699866735 301
+ .quad 0x03FD07E2794F3E8C1
+ .quad 0x03FD08B196753A125 # 0.258489943414 302
+ .quad 0x03FD08B196753A125
+ .quad 0x03FD0980DD87BA2DD # 0.259280644807 303
+ .quad 0x03FD0980DD87BA2DD
+ .quad 0x03FD0A504E97BB40C # 0.260071971904 304
+ .quad 0x03FD0A504E97BB40C
+ .quad 0x03FD0AF660EB9E278 # 0.260705484754 305
+ .quad 0x03FD0AF660EB9E278
+ .quad 0x03FD0BC61DBBA97CB # 0.261497940616 306
+ .quad 0x03FD0BC61DBBA97CB
+ .quad 0x03FD0C9604B8FC51E # 0.262291024962 307
+ .quad 0x03FD0C9604B8FC51E
+ .quad 0x03FD0D3C7586CD5E5 # 0.262925945618 308
+ .quad 0x03FD0D3C7586CD5E5
+ .quad 0x03FD0E0CA89A72D29 # 0.263720163752 309
+ .quad 0x03FD0E0CA89A72D29
+ .quad 0x03FD0EDD060B78082 # 0.264515013170 310
+ .quad 0x03FD0EDD060B78082
+ .quad 0x03FD0FAD8DEB1E2C0 # 0.265310494876 311
+ .quad 0x03FD0FAD8DEB1E2C0
+ .quad 0x03FD10547F9D26ABC # 0.265947336165 312
+ .quad 0x03FD10547F9D26ABC
+ .quad 0x03FD1125540925114 # 0.266743958529 313
+ .quad 0x03FD1125540925114
+ .quad 0x03FD11F653144CB8B # 0.267541216005 314
+ .quad 0x03FD11F653144CB8B
+ .quad 0x03FD129DA43F5BE9E # 0.268179479949 315
+ .quad 0x03FD129DA43F5BE9E
+ .quad 0x03FD136EF02E8290C # 0.268977883185 316
+ .quad 0x03FD136EF02E8290C
+ .quad 0x03FD144066EDAE406 # 0.269776924378 317
+ .quad 0x03FD144066EDAE406
+ .quad 0x03FD14E817FF359D7 # 0.270416617347 318
+ .quad 0x03FD14E817FF359D7
+ .quad 0x03FD15B9DBFA9DEC8 # 0.271216809436 319
+ .quad 0x03FD15B9DBFA9DEC8
+ .quad 0x03FD168BCAF73B3EB # 0.272017642345 320
+ .quad 0x03FD168BCAF73B3EB
+ .quad 0x03FD1733DC5D68DE8 # 0.272658770753 321
+ .quad 0x03FD1733DC5D68DE8
+ .quad 0x03FD180618EF18ADE # 0.273460759729 322
+ .quad 0x03FD180618EF18ADE
+ .quad 0x03FD18D880B3826FE # 0.274263392407 323
+ .quad 0x03FD18D880B3826FE
+ .quad 0x03FD1980F2DD42B6F # 0.274905962710 324
+ .quad 0x03FD1980F2DD42B6F
+ .quad 0x03FD1A53A8902E70B # 0.275709756661 325
+ .quad 0x03FD1A53A8902E70B
+ .quad 0x03FD1AFC59297024D # 0.276353257326 326
+ .quad 0x03FD1AFC59297024D
+ .quad 0x03FD1BCF5D04AE1EA # 0.277158215914 327
+ .quad 0x03FD1BCF5D04AE1EA
+ .quad 0x03FD1CA28C64BAE54 # 0.277963822983 328
+ .quad 0x03FD1CA28C64BAE54
+ .quad 0x03FD1D4B9E796C245 # 0.278608776246 329
+ .quad 0x03FD1D4B9E796C245
+ .quad 0x03FD1E1F1C5C3A06C # 0.279415553216 330
+ .quad 0x03FD1E1F1C5C3A06C
+ .quad 0x03FD1EC86D5747AAD # 0.280061443760 331
+ .quad 0x03FD1EC86D5747AAD
+ .quad 0x03FD1F9C39F74C559 # 0.280869394034 332
+ .quad 0x03FD1F9C39F74C559
+ .quad 0x03FD2070326F1F789 # 0.281677997620 333
+ .quad 0x03FD2070326F1F789
+ .quad 0x03FD2119E59F8789C # 0.282325351583 334
+ .quad 0x03FD2119E59F8789C
+ .quad 0x03FD21EE2D300381C # 0.283135133796 335
+ .quad 0x03FD21EE2D300381C
+ .quad 0x03FD22981FBEF797A # 0.283783432036 336
+ .quad 0x03FD22981FBEF797A
+ .quad 0x03FD236CB6A339EED # 0.284594396317 337
+ .quad 0x03FD236CB6A339EED
+ .quad 0x03FD2416E8C01F606 # 0.285243641592 338
+ .quad 0x03FD2416E8C01F606
+ .quad 0x03FD24EBCF3387FF6 # 0.286055791397 339
+ .quad 0x03FD24EBCF3387FF6
+ .quad 0x03FD2596410DF963A # 0.286705986479 340
+ .quad 0x03FD2596410DF963A
+ .quad 0x03FD266B774C2AF55 # 0.287519325279 341
+ .quad 0x03FD266B774C2AF55
+ .quad 0x03FD27162913F873F # 0.288170472950 342
+ .quad 0x03FD27162913F873F
+ .quad 0x03FD27EBAF58D8C9C # 0.288985004232 343
+ .quad 0x03FD27EBAF58D8C9C
+ .quad 0x03FD2896A13E086A3 # 0.289637107288 344
+ .quad 0x03FD2896A13E086A3
+ .quad 0x03FD296C77C5C0E13 # 0.290452834554 345
+ .quad 0x03FD296C77C5C0E13
+ .quad 0x03FD2A17A9F88EDD2 # 0.291105895801 346
+ .quad 0x03FD2A17A9F88EDD2
+ .quad 0x03FD2AEDD0FF8CC2C # 0.291922822568 347
+ .quad 0x03FD2AEDD0FF8CC2C
+ .quad 0x03FD2B9943B06BD77 # 0.292576844829 348
+ .quad 0x03FD2B9943B06BD77
+ .quad 0x03FD2C6FBB7360D0E # 0.293394974630 349
+ .quad 0x03FD2C6FBB7360D0E
+ .quad 0x03FD2D1B6ED2FA90C # 0.294049960734 350
+ .quad 0x03FD2D1B6ED2FA90C
+ .quad 0x03FD2DC73F01B0DD4 # 0.294705376127 351
+ .quad 0x03FD2DC73F01B0DD4
+ .quad 0x03FD2E9E2BCE12286 # 0.295525249913 352
+ .quad 0x03FD2E9E2BCE12286
+ .quad 0x03FD2F4A3CF22EDC2 # 0.296181633264 353
+ .quad 0x03FD2F4A3CF22EDC2
+ .quad 0x03FD30217B1006601 # 0.297002718785 354
+ .quad 0x03FD30217B1006601
+ .quad 0x03FD30CDCD5ABA762 # 0.297660072959 355
+ .quad 0x03FD30CDCD5ABA762
+ .quad 0x03FD31A55D07A8590 # 0.298482373803 356
+ .quad 0x03FD31A55D07A8590
+ .quad 0x03FD3251F0AA5CC1A # 0.299140701674 357
+ .quad 0x03FD3251F0AA5CC1A
+ .quad 0x03FD32FEA167A6D70 # 0.299799463226 358
+ .quad 0x03FD32FEA167A6D70
+ .quad 0x03FD33D6A7509D491 # 0.300623525901 359
+ .quad 0x03FD33D6A7509D491
+ .quad 0x03FD348399ADA9D94 # 0.301283265328 360
+ .quad 0x03FD348399ADA9D94
+ .quad 0x03FD3530A9454ADC9 # 0.301943440298 361
+ .quad 0x03FD3530A9454ADC9
+ .quad 0x03FD360925EC44F5C # 0.302769272371 362
+ .quad 0x03FD360925EC44F5C
+ .quad 0x03FD36B6776BE1116 # 0.303430429420 363
+ .quad 0x03FD36B6776BE1116
+ .quad 0x03FD378F469437FB4 # 0.304257490918 364
+ .quad 0x03FD378F469437FB4
+ .quad 0x03FD383CDA2E14ECB # 0.304919632971 365
+ .quad 0x03FD383CDA2E14ECB
+ .quad 0x03FD38EA8B3924521 # 0.305582213748 366
+ .quad 0x03FD38EA8B3924521
+ .quad 0x03FD39C3D1FD60E74 # 0.306411057558 367
+ .quad 0x03FD39C3D1FD60E74
+ .quad 0x03FD3A71C56BB48C7 # 0.307074627589 368
+ .quad 0x03FD3A71C56BB48C7
+ .quad 0x03FD3B1FD66BC8D10 # 0.307738638238 369
+ .quad 0x03FD3B1FD66BC8D10
+ .quad 0x03FD3BF995502CB5C # 0.308569272059 370
+ .quad 0x03FD3BF995502CB5C
+ .quad 0x03FD3CA7E8FD01DF6 # 0.309234276240 371
+ .quad 0x03FD3CA7E8FD01DF6
+ .quad 0x03FD3D565A5C5BF11 # 0.309899722945 372
+ .quad 0x03FD3D565A5C5BF11
+ .quad 0x03FD3E3091E6049FB # 0.310732154526 373
+ .quad 0x03FD3E3091E6049FB
+ .quad 0x03FD3EDF463C1683E # 0.311398599069 374
+ .quad 0x03FD3EDF463C1683E
+ .quad 0x03FD3F8E1865A82DD # 0.312065488057 375
+ .quad 0x03FD3F8E1865A82DD
+ .quad 0x03FD403D086CEA79B # 0.312732822082 376
+ .quad 0x03FD403D086CEA79B
+ .quad 0x03FD4117DE854CA15 # 0.313567616354 377
+ .quad 0x03FD4117DE854CA15
+ .quad 0x03FD41C711E4BA15E # 0.314235953889 378
+ .quad 0x03FD41C711E4BA15E
+ .quad 0x03FD427663431B221 # 0.314904738398 379
+ .quad 0x03FD427663431B221
+ .quad 0x03FD4325D2AAB6F18 # 0.315573970480 380
+ .quad 0x03FD4325D2AAB6F18
+ .quad 0x03FD44014838E5513 # 0.316411140893 381
+ .quad 0x03FD44014838E5513
+ .quad 0x03FD44B0FB5AF4F44 # 0.317081382205 382
+ .quad 0x03FD44B0FB5AF4F44
+ .quad 0x03FD4560CCA7CB3B2 # 0.317752073041 383
+ .quad 0x03FD4560CCA7CB3B2
+ .quad 0x03FD4610BC29C5E18 # 0.318423214006 384
+ .quad 0x03FD4610BC29C5E18
+ .quad 0x03FD46ECD216CDCB5 # 0.319262774126 385
+ .quad 0x03FD46ECD216CDCB5
+ .quad 0x03FD479D05B65CB60 # 0.319934930091 386
+ .quad 0x03FD479D05B65CB60
+ .quad 0x03FD484D57ACE5A1A # 0.320607538154 387
+ .quad 0x03FD484D57ACE5A1A
+ .quad 0x03FD48FDC804DD1CB # 0.321280598924 388
+ .quad 0x03FD48FDC804DD1CB
+ .quad 0x03FD49DA7F3BCC420 # 0.322122562432 389
+ .quad 0x03FD49DA7F3BCC420
+ .quad 0x03FD4A8B341552B09 # 0.322796644021 390
+ .quad 0x03FD4A8B341552B09
+ .quad 0x03FD4B3C077267E9A # 0.323471180303 391
+ .quad 0x03FD4B3C077267E9A
+ .quad 0x03FD4BECF95D97914 # 0.324146171892 392
+ .quad 0x03FD4BECF95D97914
+ .quad 0x03FD4C9E09E172C3D # 0.324821619401 393
+ .quad 0x03FD4C9E09E172C3D
+ .quad 0x03FD4D4F3908901A0 # 0.325497523449 394
+ .quad 0x03FD4D4F3908901A0
+ .quad 0x03FD4E2CDF1F341C1 # 0.326343046455 395
+ .quad 0x03FD4E2CDF1F341C1
+ .quad 0x03FD4EDE535C79642 # 0.327019979972 396
+ .quad 0x03FD4EDE535C79642
+ .quad 0x03FD4F8FE65F90500 # 0.327697372039 397
+ .quad 0x03FD4F8FE65F90500
+ .quad 0x03FD5041983326F2D # 0.328375223276 398
+ .quad 0x03FD5041983326F2D
+ .quad 0x03FD50F368E1F0F02 # 0.329053534308 399
+ .quad 0x03FD50F368E1F0F02
+ .quad 0x03FD51A55876A77F5 # 0.329732305758 400
+ .quad 0x03FD51A55876A77F5
+ .quad 0x03FD5283EF743F98B # 0.330581418486 401
+ .quad 0x03FD5283EF743F98B
+ .quad 0x03FD533624B59CA35 # 0.331261228165 402
+ .quad 0x03FD533624B59CA35
+ .quad 0x03FD53E878FFE6EAE # 0.331941500300 403
+ .quad 0x03FD53E878FFE6EAE
+ .quad 0x03FD549AEC5DEF880 # 0.332622235521 404
+ .quad 0x03FD549AEC5DEF880
+ .quad 0x03FD554D7EDA8D3C4 # 0.333303434457 405
+ .quad 0x03FD554D7EDA8D3C4
+ .quad 0x03FD560030809C759 # 0.333985097742 406
+ .quad 0x03FD560030809C759
+ .quad 0x03FD56B3015AFF52C # 0.334667226008 407
+ .quad 0x03FD56B3015AFF52C
+ .quad 0x03FD5765F1749DA6C # 0.335349819892 408
+ .quad 0x03FD5765F1749DA6C
+ .quad 0x03FD581900D864FD7 # 0.336032880027 409
+ .quad 0x03FD581900D864FD7
+ .quad 0x03FD58CC2F91489F5 # 0.336716407053 410
+ .quad 0x03FD58CC2F91489F5
+ .quad 0x03FD59AC5618CCE38 # 0.337571473373 411
+ .quad 0x03FD59AC5618CCE38
+ .quad 0x03FD5A5FCB795780C # 0.338256053239 412
+ .quad 0x03FD5A5FCB795780C
+ .quad 0x03FD5B136052BCE39 # 0.338941102075 413
+ .quad 0x03FD5B136052BCE39
+ .quad 0x03FD5BC714B008E23 # 0.339626620526 414
+ .quad 0x03FD5BC714B008E23
+ .quad 0x03FD5C7AE89C4D254 # 0.340312609234 415
+ .quad 0x03FD5C7AE89C4D254
+ .quad 0x03FD5D2EDC22A12BA # 0.340999068845 416
+ .quad 0x03FD5D2EDC22A12BA
+ .quad 0x03FD5DE2EF4E224D6 # 0.341686000008 417
+ .quad 0x03FD5DE2EF4E224D6
+ .quad 0x03FD5E972229F3C15 # 0.342373403369 418
+ .quad 0x03FD5E972229F3C15
+ .quad 0x03FD5F4B74C13EA04 # 0.343061279578 419
+ .quad 0x03FD5F4B74C13EA04
+ .quad 0x03FD5FFFE71F31E9A # 0.343749629287 420
+ .quad 0x03FD5FFFE71F31E9A
+ .quad 0x03FD60B4794F02875 # 0.344438453147 421
+ .quad 0x03FD60B4794F02875
+ .quad 0x03FD61692B5BEB520 # 0.345127751813 422
+ .quad 0x03FD61692B5BEB520
+ .quad 0x03FD621DFD512D14F # 0.345817525940 423
+ .quad 0x03FD621DFD512D14F
+ .quad 0x03FD62D2EF3A0E933 # 0.346507776183 424
+ .quad 0x03FD62D2EF3A0E933
+ .quad 0x03FD63880121DC8AB # 0.347198503200 425
+ .quad 0x03FD63880121DC8AB
+ .quad 0x03FD643D3313E9B92 # 0.347889707652 426
+ .quad 0x03FD643D3313E9B92
+ .quad 0x03FD64F2851B8EE01 # 0.348581390197 427
+ .quad 0x03FD64F2851B8EE01
+ .quad 0x03FD65A7F7442AC90 # 0.349273551498 428
+ .quad 0x03FD65A7F7442AC90
+ .quad 0x03FD665D8999224A5 # 0.349966192218 429
+ .quad 0x03FD665D8999224A5
+ .quad 0x03FD67133C25E04A5 # 0.350659313022 430
+ .quad 0x03FD67133C25E04A5
+ .quad 0x03FD67C90EF5D5C4C # 0.351352914576 431
+ .quad 0x03FD67C90EF5D5C4C
+ .quad 0x03FD687F021479CEE # 0.352046997547 432
+ .quad 0x03FD687F021479CEE
+ .quad 0x03FD6935158D499B3 # 0.352741562603 433
+ .quad 0x03FD6935158D499B3
+ .quad 0x03FD69EB496BC87E5 # 0.353436610416 434
+ .quad 0x03FD69EB496BC87E5
+ .quad 0x03FD6AA19DBB7FF34 # 0.354132141656 435
+ .quad 0x03FD6AA19DBB7FF34
+ .quad 0x03FD6B581287FF9FD # 0.354828156996 436
+ .quad 0x03FD6B581287FF9FD
+ .quad 0x03FD6C0EA7DCDD591 # 0.355524657112 437
+ .quad 0x03FD6C0EA7DCDD591
+ .quad 0x03FD6C97AD3CFCFD9 # 0.356047350738 438
+ .quad 0x03FD6C97AD3CFCFD9
+ .quad 0x03FD6D4E7B9C727EC # 0.356744700836 439
+ .quad 0x03FD6D4E7B9C727EC
+ .quad 0x03FD6E056AA4421D6 # 0.357442537571 440
+ .quad 0x03FD6E056AA4421D6
+ .quad 0x03FD6EBC7A6019066 # 0.358140861621 441
+ .quad 0x03FD6EBC7A6019066
+ .quad 0x03FD6F73AADBAAAB7 # 0.358839673669 442
+ .quad 0x03FD6F73AADBAAAB7
+ .quad 0x03FD702AFC22B0C6D # 0.359538974397 443
+ .quad 0x03FD702AFC22B0C6D
+ .quad 0x03FD70E26E40EB5FA # 0.360238764489 444
+ .quad 0x03FD70E26E40EB5FA
+ .quad 0x03FD719A014220CF5 # 0.360939044629 445
+ .quad 0x03FD719A014220CF5
+ .quad 0x03FD7251B5321DC54 # 0.361639815506 446
+ .quad 0x03FD7251B5321DC54
+ .quad 0x03FD73098A1CB54BA # 0.362341077807 447
+ .quad 0x03FD73098A1CB54BA
+ .quad 0x03FD73937F783CEBA # 0.362867347444 448
+ .quad 0x03FD73937F783CEBA
+ .quad 0x03FD744B8E35E9EDA # 0.363569471398 449
+ .quad 0x03FD744B8E35E9EDA
+ .quad 0x03FD7503BE0ED6C66 # 0.364272088676 450
+ .quad 0x03FD7503BE0ED6C66
+ .quad 0x03FD75BC0F0EEE7DE # 0.364975199972 451
+ .quad 0x03FD75BC0F0EEE7DE
+ .quad 0x03FD76748142228C7 # 0.365678805982 452
+ .quad 0x03FD76748142228C7
+ .quad 0x03FD772D14B46AE00 # 0.366382907402 453
+ .quad 0x03FD772D14B46AE00
+ .quad 0x03FD77E5C971C5E06 # 0.367087504930 454
+ .quad 0x03FD77E5C971C5E06
+ .quad 0x03FD787066E04915F # 0.367616279067 455
+ .quad 0x03FD787066E04915F
+ .quad 0x03FD792955FDF47A3 # 0.368321746469 456
+ .quad 0x03FD792955FDF47A3
+ .quad 0x03FD79E26687CFB3D # 0.369027711906 457
+ .quad 0x03FD79E26687CFB3D
+ .quad 0x03FD7A9B9889F19E2 # 0.369734176082 458
+ .quad 0x03FD7A9B9889F19E2
+ .quad 0x03FD7B54EC1077A48 # 0.370441139703 459
+ .quad 0x03FD7B54EC1077A48
+ .quad 0x03FD7C0E612785C74 # 0.371148603475 460
+ .quad 0x03FD7C0E612785C74
+ .quad 0x03FD7C998F06FB152 # 0.371679529954 461
+ .quad 0x03FD7C998F06FB152
+ .quad 0x03FD7D533EF841E8A # 0.372387870696 462
+ .quad 0x03FD7D533EF841E8A
+ .quad 0x03FD7E0D109B95F19 # 0.373096713539 463
+ .quad 0x03FD7E0D109B95F19
+ .quad 0x03FD7EC703FD340AA # 0.373806059198 464
+ .quad 0x03FD7EC703FD340AA
+ .quad 0x03FD7F8119295FB9B # 0.374515908385 465
+ .quad 0x03FD7F8119295FB9B
+ .quad 0x03FD800CBF3ED1CC2 # 0.375048626146 466
+ .quad 0x03FD800CBF3ED1CC2
+ .quad 0x03FD80C70FAB0BDF6 # 0.375759358229 467
+ .quad 0x03FD80C70FAB0BDF6
+ .quad 0x03FD81818203AFC7F # 0.376470595813 468
+ .quad 0x03FD81818203AFC7F
+ .quad 0x03FD823C16551A3C3 # 0.377182339615 469
+ .quad 0x03FD823C16551A3C3
+ .quad 0x03FD82C81BE4DFF4A # 0.377716480107 470
+ .quad 0x03FD82C81BE4DFF4A
+ .quad 0x03FD8382EBC7794D1 # 0.378429111528 471
+ .quad 0x03FD8382EBC7794D1
+ .quad 0x03FD843DDDC4FB137 # 0.379142251156 472
+ .quad 0x03FD843DDDC4FB137
+ .quad 0x03FD84F8F1E9DB72B # 0.379855899714 473
+ .quad 0x03FD84F8F1E9DB72B
+ .quad 0x03FD85855776DCBFB # 0.380391470556 474
+ .quad 0x03FD85855776DCBFB
+ .quad 0x03FD8640A77EB3957 # 0.381106011494 475
+ .quad 0x03FD8640A77EB3957
+ .quad 0x03FD86FC19D05148E # 0.381821063366 476
+ .quad 0x03FD86FC19D05148E
+ .quad 0x03FD87B7AE7845C0F # 0.382536626902 477
+ .quad 0x03FD87B7AE7845C0F
+ .quad 0x03FD8844748678822 # 0.383073635776 478
+ .quad 0x03FD8844748678822
+ .quad 0x03FD89004563D3DFD # 0.383790096491 479
+ .quad 0x03FD89004563D3DFD
+ .quad 0x03FD89BC38BA356B4 # 0.384507070890 480
+ .quad 0x03FD89BC38BA356B4
+ .quad 0x03FD8A4945E20894E # 0.385045139237 481
+ .quad 0x03FD8A4945E20894E
+ .quad 0x03FD8B0575AAB1FC5 # 0.385763014358 482
+ .quad 0x03FD8B0575AAB1FC5
+ .quad 0x03FD8BC1C80F45A32 # 0.386481405193 483
+ .quad 0x03FD8BC1C80F45A32
+ .quad 0x03FD8C7E3D1C80B2F # 0.387200312485 484
+ .quad 0x03FD8C7E3D1C80B2F
+ .quad 0x03FD8D0BABACC89EE # 0.387739832326 485
+ .quad 0x03FD8D0BABACC89EE
+ .quad 0x03FD8DC85D7FE5013 # 0.388459645206 486
+ .quad 0x03FD8DC85D7FE5013
+ .quad 0x03FD8E85321ED5598 # 0.389179976589 487
+ .quad 0x03FD8E85321ED5598
+ .quad 0x03FD8F12E873862C7 # 0.389720565845 488
+ .quad 0x03FD8F12E873862C7
+ .quad 0x03FD8FCFFA1614AA0 # 0.390441806410 489
+ .quad 0x03FD8FCFFA1614AA0
+ .quad 0x03FD908D2EA7D9511 # 0.391163567538 490
+ .quad 0x03FD908D2EA7D9511
+ .quad 0x03FD911B2D09ED9D6 # 0.391705230456 491
+ .quad 0x03FD911B2D09ED9D6
+ .quad 0x03FD91D89EDD6B7FF # 0.392427904381 492
+ .quad 0x03FD91D89EDD6B7FF
+ .quad 0x03FD929633C3B7D3E # 0.393151100941 493
+ .quad 0x03FD929633C3B7D3E
+ .quad 0x03FD93247A7C99B52 # 0.393693841796 494
+ .quad 0x03FD93247A7C99B52
+ .quad 0x03FD93E24CE3195E8 # 0.394417954789 495
+ .quad 0x03FD93E24CE3195E8
+ .quad 0x03FD9470C1CB1962E # 0.394961383840 496
+ .quad 0x03FD9470C1CB1962E
+ .quad 0x03FD952ED1D9C0435 # 0.395686415592 497
+ .quad 0x03FD952ED1D9C0435
+ .quad 0x03FD95ED0535EA5D9 # 0.396411973396 498
+ .quad 0x03FD95ED0535EA5D9
+ .quad 0x03FD967BC2EDCCE17 # 0.396956487431 499
+ .quad 0x03FD967BC2EDCCE17
+ .quad 0x03FD973A3431356AE # 0.397682967666 500
+ .quad 0x03FD973A3431356AE
+ .quad 0x03FD97F8C8E64A1C7 # 0.398409976059 501
+ .quad 0x03FD97F8C8E64A1C7
+ .quad 0x03FD9887CFB8A3932 # 0.398955579419 502
+ .quad 0x03FD9887CFB8A3932
+ .quad 0x03FD9946A2946EF3C # 0.399683513937 503
+ .quad 0x03FD9946A2946EF3C
+ .quad 0x03FD99D5D8130607C # 0.400229812776 504
+ .quad 0x03FD99D5D8130607C
+ .quad 0x03FD9A94E93E1EC37 # 0.400958675782 505
+ .quad 0x03FD9A94E93E1EC37
+ .quad 0x03FD9B244D87735E8 # 0.401505671875 506
+ .quad 0x03FD9B244D87735E8
+ .quad 0x03FD9BE39D2A97F0B # 0.402235465741 507
+ .quad 0x03FD9BE39D2A97F0B
+ .quad 0x03FD9CA3109266E23 # 0.402965792595 508
+ .quad 0x03FD9CA3109266E23
+ .quad 0x03FD9D32BEA15ED3A # 0.403513887977 509
+ .quad 0x03FD9D32BEA15ED3A
+ .quad 0x03FD9DF270C1914A8 # 0.404245149435 510
+ .quad 0x03FD9DF270C1914A8
+ .quad 0x03FD9E824DEA3E135 # 0.404793946669 511
+ .quad 0x03FD9E824DEA3E135
+ .quad 0x03FD9F423EEBF9DA1 # 0.405526145127 512
+ .quad 0x03FD9F423EEBF9DA1
+ .quad 0x03FD9FD24B4D47012 # 0.406075646011 513
+ .quad 0x03FD9FD24B4D47012
+ .quad 0x03FDA0927B59DA6E2 # 0.406808783874 514
+ .quad 0x03FDA0927B59DA6E2
+ .quad 0x03FDA152CF7F3B46D # 0.407542459622 515
+ .quad 0x03FDA152CF7F3B46D
+ .quad 0x03FDA1E32653B420E # 0.408093069896 516
+ .quad 0x03FDA1E32653B420E
+ .quad 0x03FDA2A3B9C527DB1 # 0.408827688845 517
+ .quad 0x03FDA2A3B9C527DB1
+ .quad 0x03FDA33440224FA79 # 0.409379007429 518
+ .quad 0x03FDA33440224FA79
+ .quad 0x03FDA3F513098DD09 # 0.410114572008 519
+ .quad 0x03FDA3F513098DD09
+ .quad 0x03FDA485C90EBDB0C # 0.410666600728 520
+ .quad 0x03FDA485C90EBDB0C
+ .quad 0x03FDA546DB95A721A # 0.411403113374 521
+ .quad 0x03FDA546DB95A721A
+ .quad 0x03FDA5D7C16257437 # 0.411955854060 522
+ .quad 0x03FDA5D7C16257437
+ .quad 0x03FDA69913B2F6572 # 0.412693317221 523
+ .quad 0x03FDA69913B2F6572
+ .quad 0x03FDA72A2966BE1EA # 0.413246771713 524
+ .quad 0x03FDA72A2966BE1EA
+ .quad 0x03FDA7EBBBAB46E8B # 0.413985187844 525
+ .quad 0x03FDA7EBBBAB46E8B
+ .quad 0x03FDA87D0165DD199 # 0.414539357989 526
+ .quad 0x03FDA87D0165DD199
+ .quad 0x03FDA93ED3C8AD9E3 # 0.415278729556 527
+ .quad 0x03FDA93ED3C8AD9E3
+ .quad 0x03FDA9D049A9E884A # 0.415833617206 528
+ .quad 0x03FDA9D049A9E884A
+ .quad 0x03FDAA925C5588EFA # 0.416573946686 529
+ .quad 0x03FDAA925C5588EFA
+ .quad 0x03FDAB24027D5E8AF # 0.417129553701 530
+ .quad 0x03FDAB24027D5E8AF
+ .quad 0x03FDABE6559C8167C # 0.417870843580 531
+ .quad 0x03FDABE6559C8167C
+ .quad 0x03FDAC782C2B07944 # 0.418427171828 532
+ .quad 0x03FDAC782C2B07944
+ .quad 0x03FDAD3ABFE88A06E # 0.419169424599 533
+ .quad 0x03FDAD3ABFE88A06E
+ .quad 0x03FDADCCC6FDF6A80 # 0.419726475955 534
+ .quad 0x03FDADCCC6FDF6A80
+ .quad 0x03FDAE5EE2E961227 # 0.420283837790 535
+ .quad 0x03FDAE5EE2E961227
+ .quad 0x03FDAF21D34189D0A # 0.421027470470 536
+ .quad 0x03FDAF21D34189D0A
+ .quad 0x03FDAFB41FE2167B4 # 0.421585558104 537
+ .quad 0x03FDAFB41FE2167B4
+ .quad 0x03FDB07751416A7F3 # 0.422330159776 538
+ .quad 0x03FDB07751416A7F3
+ .quad 0x03FDB109CEB79DB8A # 0.422888975102 539
+ .quad 0x03FDB109CEB79DB8A
+ .quad 0x03FDB1CD41498DF12 # 0.423634548296 540
+ .quad 0x03FDB1CD41498DF12
+ .quad 0x03FDB25FEFB60CB2E # 0.424194093214 541
+ .quad 0x03FDB25FEFB60CB2E
+ .quad 0x03FDB323A3A63594A # 0.424940640468 542
+ .quad 0x03FDB323A3A63594A
+ .quad 0x03FDB3B68329C59E9 # 0.425500916886 543
+ .quad 0x03FDB3B68329C59E9
+ .quad 0x03FDB44977C148F1A # 0.426061507389 544
+ .quad 0x03FDB44977C148F1A
+ .quad 0x03FDB50D895F7773A # 0.426809450580 545
+ .quad 0x03FDB50D895F7773A
+ .quad 0x03FDB5A0AF3D169CD # 0.427370775322 546
+ .quad 0x03FDB5A0AF3D169CD
+ .quad 0x03FDB66502A41E541 # 0.428119698779 547
+ .quad 0x03FDB66502A41E541
+ .quad 0x03FDB6F859E8EF639 # 0.428681759684 548
+ .quad 0x03FDB6F859E8EF639
+ .quad 0x03FDB78BC664238C0 # 0.429244136679 549
+ .quad 0x03FDB78BC664238C0
+ .quad 0x03FDB85078123E586 # 0.429994464983 550
+ .quad 0x03FDB85078123E586
+ .quad 0x03FDB8E41624226C5 # 0.430557580905 551
+ .quad 0x03FDB8E41624226C5
+ .quad 0x03FDB9A90A06BCB3D # 0.431308895742 552
+ .quad 0x03FDB9A90A06BCB3D
+ .quad 0x03FDBA3CD9D0B81BD # 0.431872752537 553
+ .quad 0x03FDBA3CD9D0B81BD
+ .quad 0x03FDBAD0BEF3DB164 # 0.432436927446 554
+ .quad 0x03FDBAD0BEF3DB164
+ .quad 0x03FDBB9611B80E2FC # 0.433189656123 555
+ .quad 0x03FDBB9611B80E2FC
+ .quad 0x03FDBC2A28C33B75D # 0.433754574696 556
+ .quad 0x03FDBC2A28C33B75D
+ .quad 0x03FDBCBE553C2BDDF # 0.434319812582 557
+ .quad 0x03FDBCBE553C2BDDF
+ .quad 0x03FDBD84073D8EC2B # 0.435073960430 558
+ .quad 0x03FDBD84073D8EC2B
+ .quad 0x03FDBE1865CEC1EC9 # 0.435639944787 559
+ .quad 0x03FDBE1865CEC1EC9
+ .quad 0x03FDBEACD9E271AD1 # 0.436206249662 560
+ .quad 0x03FDBEACD9E271AD1
+ .quad 0x03FDBF72EB7D20355 # 0.436961822044 561
+ .quad 0x03FDBF72EB7D20355
+ .quad 0x03FDC00791D99132B # 0.437528876213 562
+ .quad 0x03FDC00791D99132B
+ .quad 0x03FDC09C4DCD565AB # 0.438096252115 563
+ .quad 0x03FDC09C4DCD565AB
+ .quad 0x03FDC162BF5DF23E4 # 0.438853254422 564
+ .quad 0x03FDC162BF5DF23E4
+ .quad 0x03FDC1F7ADCB3DAB0 # 0.439421382456 565
+ .quad 0x03FDC1F7ADCB3DAB0
+ .quad 0x03FDC28CB1E4D32FD # 0.439989833442 566
+ .quad 0x03FDC28CB1E4D32FD
+ .quad 0x03FDC35383C8850B0 # 0.440748271097 567
+ .quad 0x03FDC35383C8850B0
+ .quad 0x03FDC3E8BA8CACF27 # 0.441317477070 568
+ .quad 0x03FDC3E8BA8CACF27
+ .quad 0x03FDC47E071233744 # 0.441887007223 569
+ .quad 0x03FDC47E071233744
+ .quad 0x03FDC54539A6ABCD2 # 0.442646885679 570
+ .quad 0x03FDC54539A6ABCD2
+ .quad 0x03FDC5DAB908186FF # 0.443217173690 571
+ .quad 0x03FDC5DAB908186FF
+ .quad 0x03FDC6704E4016FF7 # 0.443787787115 572
+ .quad 0x03FDC6704E4016FF7
+ .quad 0x03FDC737E1E38F4FB # 0.444549111857 573
+ .quad 0x03FDC737E1E38F4FB
+ .quad 0x03FDC7CDAA290FEAD # 0.445120486027 574
+ .quad 0x03FDC7CDAA290FEAD
+ .quad 0x03FDC863885A74D16 # 0.445692186852 575
+ .quad 0x03FDC863885A74D16
+ .quad 0x03FDC8F97C7E299DB # 0.446264214707 576
+ .quad 0x03FDC8F97C7E299DB
+ .quad 0x03FDC9C18EDC7C26B # 0.447027427871 577
+ .quad 0x03FDC9C18EDC7C26B
+ .quad 0x03FDCA57B64E9DB05 # 0.447600220249 578
+ .quad 0x03FDCA57B64E9DB05
+ .quad 0x03FDCAEDF3C88A364 # 0.448173340907 579
+ .quad 0x03FDCAEDF3C88A364
+ .quad 0x03FDCB844750B9995 # 0.448746790220 580
+ .quad 0x03FDCB844750B9995
+ .quad 0x03FDCC4CD90B3ECE5 # 0.449511901199 581
+ .quad 0x03FDCC4CD90B3ECE5
+ .quad 0x03FDCCE3602341C10 # 0.450086118843 582
+ .quad 0x03FDCCE3602341C10
+ .quad 0x03FDCD79FD5F2BC77 # 0.450660666403 583
+ .quad 0x03FDCD79FD5F2BC77
+ .quad 0x03FDCE10B0C581284 # 0.451235544257 584
+ .quad 0x03FDCE10B0C581284
+ .quad 0x03FDCED9C27EC6607 # 0.452002562511 585
+ .quad 0x03FDCED9C27EC6607
+ .quad 0x03FDCF70A9B6D3810 # 0.452578212532 586
+ .quad 0x03FDCF70A9B6D3810
+ .quad 0x03FDD007A72F19BBC # 0.453154194116 587
+ .quad 0x03FDD007A72F19BBC
+ .quad 0x03FDD09EBAEE29DD8 # 0.453730507647 588
+ .quad 0x03FDD09EBAEE29DD8
+ .quad 0x03FDD1684D49F46AE # 0.454499442710 589
+ .quad 0x03FDD1684D49F46AE
+ .quad 0x03FDD1FF951D1F1B3 # 0.455076532271 590
+ .quad 0x03FDD1FF951D1F1B3
+ .quad 0x03FDD296F34D0B65C # 0.455653955057 591
+ .quad 0x03FDD296F34D0B65C
+ .quad 0x03FDD32E67E056BD5 # 0.456231711452 592
+ .quad 0x03FDD32E67E056BD5
+ .quad 0x03FDD3C5F2DDA1840 # 0.456809801843 593
+ .quad 0x03FDD3C5F2DDA1840
+ .quad 0x03FDD490246DEFA6A # 0.457581109247 594
+ .quad 0x03FDD490246DEFA6A
+ .quad 0x03FDD527E3D1B95FC # 0.458159980465 595
+ .quad 0x03FDD527E3D1B95FC
+ .quad 0x03FDD5BFB9B5AE71F # 0.458739186968 596
+ .quad 0x03FDD5BFB9B5AE71F
+ .quad 0x03FDD657A6207C0DB # 0.459318729146 597
+ .quad 0x03FDD657A6207C0DB
+ .quad 0x03FDD6EFA918D25CE # 0.459898607388 598
+ .quad 0x03FDD6EFA918D25CE
+ .quad 0x03FDD7BA7AD9E7DA1 # 0.460672301817 599
+ .quad 0x03FDD7BA7AD9E7DA1
+ .quad 0x03FDD852B28BE5A0F # 0.461252965726 600
+ .quad 0x03FDD852B28BE5A0F
+ .quad 0x03FDD8EB00E1CCE14 # 0.461833967001 601
+ .quad 0x03FDD8EB00E1CCE14
+ .quad 0x03FDD98365E25ABB9 # 0.462415306035 602
+ .quad 0x03FDD98365E25ABB9
+ .quad 0x03FDDA1BE1944F538 # 0.462996983220 603
+ .quad 0x03FDDA1BE1944F538
+ .quad 0x03FDDAE75484C9615 # 0.463773079495 604
+ .quad 0x03FDDAE75484C9615
+ .quad 0x03FDDB8005445488B # 0.464355547233 605
+ .quad 0x03FDDB8005445488B
+ .quad 0x03FDDC18CCCBDCB83 # 0.464938354438 606
+ .quad 0x03FDDC18CCCBDCB83
+ .quad 0x03FDDCB1AB222F33D # 0.465521501504 607
+ .quad 0x03FDDCB1AB222F33D
+ .quad 0x03FDDD4AA04E1C4B7 # 0.466104988830 608
+ .quad 0x03FDDD4AA04E1C4B7
+ .quad 0x03FDDDE3AC56775D2 # 0.466688816812 609
+ .quad 0x03FDDDE3AC56775D2
+ .quad 0x03FDDE7CCF4216D6E # 0.467272985848 610
+ .quad 0x03FDDE7CCF4216D6E
+ .quad 0x03FDDF492177D7BBC # 0.468052409114 611
+ .quad 0x03FDDF492177D7BBC
+ .quad 0x03FDDFE279E5BF4EE # 0.468637375496 612
+ .quad 0x03FDDFE279E5BF4EE
+ .quad 0x03FDE07BE94DCC439 # 0.469222684263 613
+ .quad 0x03FDE07BE94DCC439
+ .quad 0x03FDE1156FB6E2626 # 0.469808335817 614
+ .quad 0x03FDE1156FB6E2626
+ .quad 0x03FDE1AF0D27E88D7 # 0.470394330560 615
+ .quad 0x03FDE1AF0D27E88D7
+ .quad 0x03FDE248C1A7C8C26 # 0.470980668894 616
+ .quad 0x03FDE248C1A7C8C26
+ .quad 0x03FDE2E28D3D701CC # 0.471567351222 617
+ .quad 0x03FDE2E28D3D701CC
+ .quad 0x03FDE37C6FEFCED73 # 0.472154377948 618
+ .quad 0x03FDE37C6FEFCED73
+ .quad 0x03FDE449C232C39D8 # 0.472937616681 619
+ .quad 0x03FDE449C232C39D8
+ .quad 0x03FDE4E3DAEDDB5F6 # 0.473525448578 620
+ .quad 0x03FDE4E3DAEDDB5F6
+ .quad 0x03FDE57E0ADCE1EA5 # 0.474113626224 621
+ .quad 0x03FDE57E0ADCE1EA5
+ .quad 0x03FDE6185206D516F # 0.474702150027 622
+ .quad 0x03FDE6185206D516F
+ .quad 0x03FDE6B2B072B5E6F # 0.475291020395 623
+ .quad 0x03FDE6B2B072B5E6F
+ .quad 0x03FDE74D26278887A # 0.475880237735 624
+ .quad 0x03FDE74D26278887A
+ .quad 0x03FDE7E7B32C5453F # 0.476469802457 625
+ .quad 0x03FDE7E7B32C5453F
+ .quad 0x03FDE882578823D52 # 0.477059714970 626
+ .quad 0x03FDE882578823D52
+ .quad 0x03FDE91D134204C67 # 0.477649975686 627
+ .quad 0x03FDE91D134204C67
+ .quad 0x03FDE9B7E6610815A # 0.478240585015 628
+ .quad 0x03FDE9B7E6610815A
+ .quad 0x03FDEA52D0EC41E5E # 0.478831543369 629
+ .quad 0x03FDEA52D0EC41E5E
+ .quad 0x03FDEB218376ECFC0 # 0.479620031484 630
+ .quad 0x03FDEB218376ECFC0
+ .quad 0x03FDEBBCA4C4E9E87 # 0.480211805838 631
+ .quad 0x03FDEBBCA4C4E9E87
+ .quad 0x03FDEC57DD96CD0CB # 0.480803930597 632
+ .quad 0x03FDEC57DD96CD0CB
+ .quad 0x03FDECF32DF3B887D # 0.481396406174 633
+ .quad 0x03FDECF32DF3B887D
+ .quad 0x03FDED8E95E2D1B88 # 0.481989232987 634
+ .quad 0x03FDED8E95E2D1B88
+ .quad 0x03FDEE2A156B413E5 # 0.482582411453 635
+ .quad 0x03FDEE2A156B413E5
+ .quad 0x03FDEEC5AC9432FCB # 0.483175941987 636
+ .quad 0x03FDEEC5AC9432FCB
+ .quad 0x03FDEF615B64D61C7 # 0.483769825010 637
+ .quad 0x03FDEF615B64D61C7
+ .quad 0x03FDEFFD21E45D0D1 # 0.484364060939 638
+ .quad 0x03FDEFFD21E45D0D1
+ .quad 0x03FDF0990019FD887 # 0.484958650194 639
+ .quad 0x03FDF0990019FD887
+ .quad 0x03FDF134F60CF092D # 0.485553593197 640
+ .quad 0x03FDF134F60CF092D
+ .quad 0x03FDF1D103C4727E4 # 0.486148890367 641
+ .quad 0x03FDF1D103C4727E4
+ .quad 0x03FDF26D2947C2EC5 # 0.486744542127 642
+ .quad 0x03FDF26D2947C2EC5
+ .quad 0x03FDF309669E24CF9 # 0.487340548899 643
+ .quad 0x03FDF309669E24CF9
+ .quad 0x03FDF3A5BBCEDE6E1 # 0.487936911107 644
+ .quad 0x03FDF3A5BBCEDE6E1
+ .quad 0x03FDF44228E13963A # 0.488533629176 645
+ .quad 0x03FDF44228E13963A
+ .quad 0x03FDF4DEADDC82A35 # 0.489130703529 646
+ .quad 0x03FDF4DEADDC82A35
+ .quad 0x03FDF57B4AC80A79A # 0.489728134594 647
+ .quad 0x03FDF57B4AC80A79A
+ .quad 0x03FDF617FFAB248ED # 0.490325922795 648
+ .quad 0x03FDF617FFAB248ED
+ .quad 0x03FDF6B4CC8D27E87 # 0.490924068561 649
+ .quad 0x03FDF6B4CC8D27E87
+ .quad 0x03FDF751B1756EEC8 # 0.491522572320 650
+ .quad 0x03FDF751B1756EEC8
+ .quad 0x03FDF7EEAE6B5761C # 0.492121434499 651
+ .quad 0x03FDF7EEAE6B5761C
+ .quad 0x03FDF88BC3764273B # 0.492720655530 652
+ .quad 0x03FDF88BC3764273B
+ .quad 0x03FDF928F09D94B32 # 0.493320235842 653
+ .quad 0x03FDF928F09D94B32
+ .quad 0x03FDF9C635E8B6192 # 0.493920175866 654
+ .quad 0x03FDF9C635E8B6192
+ .quad 0x03FDFA63935F1208C # 0.494520476034 655
+ .quad 0x03FDFA63935F1208C
+ .quad 0x03FDFB0109081751A # 0.495121136779 656
+ .quad 0x03FDFB0109081751A
+ .quad 0x03FDFB9E96EB38311 # 0.495722158534 657
+ .quad 0x03FDFB9E96EB38311
+ .quad 0x03FDFC3C3D0FEA555 # 0.496323541733 658
+ .quad 0x03FDFC3C3D0FEA555
+ .quad 0x03FDFCD9FB7DA6DEF # 0.496925286812 659
+ .quad 0x03FDFCD9FB7DA6DEF
+ .quad 0x03FDFD77D23BEA634 # 0.497527394206 660
+ .quad 0x03FDFD77D23BEA634
+ .quad 0x03FDFE15C15234EE2 # 0.498129864352 661
+ .quad 0x03FDFE15C15234EE2
+ .quad 0x03FDFEB3C8C80A04E # 0.498732697687 662
+ .quad 0x03FDFEB3C8C80A04E
+ .quad 0x03FDFF51E8A4F0A74 # 0.499335894649 663
+ .quad 0x03FDFF51E8A4F0A74
+ .quad 0x03FDFFF020F07352E # 0.499939455677 664
+ .quad 0x03FDFFF020F07352E
+ .quad 0x03FE004738D910023 # 0.500543381211 665
+ .quad 0x03FE004738D910023
+ .quad 0x03FE00966D78C41CF # 0.501147671692 666
+ .quad 0x03FE00966D78C41CF
+ .quad 0x03FE00E5AE5B207AB # 0.501752327560 667
+ .quad 0x03FE00E5AE5B207AB
+ .quad 0x03FE011A8B18F0ED6 # 0.502155634684 668
+ .quad 0x03FE011A8B18F0ED6
+ .quad 0x03FE0169E072D7311 # 0.502760900515 669
+ .quad 0x03FE0169E072D7311
+ .quad 0x03FE01B942198A5A1 # 0.503366532915 670
+ .quad 0x03FE01B942198A5A1
+ .quad 0x03FE0208B010DB642 # 0.503972532327 671
+ .quad 0x03FE0208B010DB642
+ .quad 0x03FE02582A5C9D122 # 0.504578899198 672
+ .quad 0x03FE02582A5C9D122
+ .quad 0x03FE02A7B100A3EF0 # 0.505185633972 673
+ .quad 0x03FE02A7B100A3EF0
+ .quad 0x03FE02F74400C64EA # 0.505792737097 674
+ .quad 0x03FE02F74400C64EA
+ .quad 0x03FE0346E360DC4F9 # 0.506400209020 675
+ .quad 0x03FE0346E360DC4F9
+ .quad 0x03FE03968F24BFDB6 # 0.507008050190 676
+ .quad 0x03FE03968F24BFDB6
+ .quad 0x03FE03E647504CA89 # 0.507616261055 677
+ .quad 0x03FE03E647504CA89
+ .quad 0x03FE04360BE7603AE # 0.508224842066 678
+ .quad 0x03FE04360BE7603AE
+ .quad 0x03FE046B4089BE0FD # 0.508630768599 679
+ .quad 0x03FE046B4089BE0FD
+ .quad 0x03FE04BB19DCA36B3 # 0.509239967521 680
+ .quad 0x03FE04BB19DCA36B3
+ .quad 0x03FE050AFFA5671A5 # 0.509849537793 681
+ .quad 0x03FE050AFFA5671A5
+ .quad 0x03FE055AF1E7ED47B # 0.510459479867 682
+ .quad 0x03FE055AF1E7ED47B
+ .quad 0x03FE05AAF0A81BF04 # 0.511069794198 683
+ .quad 0x03FE05AAF0A81BF04
+ .quad 0x03FE05FAFBE9DAE58 # 0.511680481240 684
+ .quad 0x03FE05FAFBE9DAE58
+ .quad 0x03FE064B13B113CDD # 0.512291541448 685
+ .quad 0x03FE064B13B113CDD
+ .quad 0x03FE069B3801B2263 # 0.512902975280 686
+ .quad 0x03FE069B3801B2263
+ .quad 0x03FE06D0AC85B63A2 # 0.513310805628 687
+ .quad 0x03FE06D0AC85B63A2
+ .quad 0x03FE0720E5C40DF1D # 0.513922863181 688
+ .quad 0x03FE0720E5C40DF1D
+ .quad 0x03FE07712B9648153 # 0.514535295577 689
+ .quad 0x03FE07712B9648153
+ .quad 0x03FE07C17E0056E7C # 0.515148103277 690
+ .quad 0x03FE07C17E0056E7C
+ .quad 0x03FE0811DD062E889 # 0.515761286740 691
+ .quad 0x03FE0811DD062E889
+ .quad 0x03FE086248ABC4F3B # 0.516374846428 692
+ .quad 0x03FE086248ABC4F3B
+ .quad 0x03FE08B2C0F512033 # 0.516988782802 693
+ .quad 0x03FE08B2C0F512033
+ .quad 0x03FE08E86D82DA3EE # 0.517398283218 694
+ .quad 0x03FE08E86D82DA3EE
+ .quad 0x03FE0938FAE5D8E9B # 0.518012848432 695
+ .quad 0x03FE0938FAE5D8E9B
+ .quad 0x03FE098994F72C539 # 0.518627791569 696
+ .quad 0x03FE098994F72C539
+ .quad 0x03FE09DA3BBAD339C # 0.519243113094 697
+ .quad 0x03FE09DA3BBAD339C
+ .quad 0x03FE0A2AEF34CE3D1 # 0.519858813473 698
+ .quad 0x03FE0A2AEF34CE3D1
+ .quad 0x03FE0A7BAF691FE34 # 0.520474893172 699
+ .quad 0x03FE0A7BAF691FE34
+ .quad 0x03FE0AB18BF5823C3 # 0.520885823936 700
+ .quad 0x03FE0AB18BF5823C3
+ .quad 0x03FE0B02616952989 # 0.521502536876 701
+ .quad 0x03FE0B02616952989
+ .quad 0x03FE0B5343A234476 # 0.522119630385 702
+ .quad 0x03FE0B5343A234476
+ .quad 0x03FE0BA432A430CA2 # 0.522737104934 703
+ .quad 0x03FE0BA432A430CA2
+ .quad 0x03FE0BF52E73538CE # 0.523354960993 704
+ .quad 0x03FE0BF52E73538CE
+ .quad 0x03FE0C463713A9E6F # 0.523973199034 705
+ .quad 0x03FE0C463713A9E6F
+ .quad 0x03FE0C7C43F4C861E # 0.524385570174 706
+ .quad 0x03FE0C7C43F4C861E
+ .quad 0x03FE0CCD61FAD07D2 # 0.525004445903 707
+ .quad 0x03FE0CCD61FAD07D2
+ .quad 0x03FE0D1E8CDCE3DB6 # 0.525623704876 708
+ .quad 0x03FE0D1E8CDCE3DB6
+ .quad 0x03FE0D6FC49F16E93 # 0.526243347569 709
+ .quad 0x03FE0D6FC49F16E93
+ .quad 0x03FE0DC109458004A # 0.526863374456 710
+ .quad 0x03FE0DC109458004A
+ .quad 0x03FE0DF73E353F0ED # 0.527276939392 711
+ .quad 0x03FE0DF73E353F0ED
+ .quad 0x03FE0E4898611CCE1 # 0.527897607665 712
+ .quad 0x03FE0E4898611CCE1
+ .quad 0x03FE0E99FF7C20738 # 0.528518661406 713
+ .quad 0x03FE0E99FF7C20738
+ .quad 0x03FE0EEB738A67874 # 0.529140101094 714
+ .quad 0x03FE0EEB738A67874
+ .quad 0x03FE0F21C81D1ADC3 # 0.529554608872 715
+ .quad 0x03FE0F21C81D1ADC3
+ .quad 0x03FE0F7351C9FCD7F # 0.530176692874 716
+ .quad 0x03FE0F7351C9FCD7F
+ .quad 0x03FE0FC4E875254C1 # 0.530799164104 717
+ .quad 0x03FE0FC4E875254C1
+ .quad 0x03FE10168C22B8FB9 # 0.531422023047 718
+ .quad 0x03FE10168C22B8FB9
+ .quad 0x03FE10683CD6DEA54 # 0.532045270185 719
+ .quad 0x03FE10683CD6DEA54
+ .quad 0x03FE109EB9E2E4C97 # 0.532460984179 720
+ .quad 0x03FE109EB9E2E4C97
+ .quad 0x03FE10F08055E7785 # 0.533084879385 721
+ .quad 0x03FE10F08055E7785
+ .quad 0x03FE114253DA97DA0 # 0.533709164079 722
+ .quad 0x03FE114253DA97DA0
+ .quad 0x03FE1194347523FDC # 0.534333838748 723
+ .quad 0x03FE1194347523FDC
+ .quad 0x03FE11CAD1789B0F8 # 0.534750505421 724
+ .quad 0x03FE11CAD1789B0F8
+ .quad 0x03FE121CC7EB8F7E6 # 0.535375831132 725
+ .quad 0x03FE121CC7EB8F7E6
+ .quad 0x03FE126ECB7F8F007 # 0.536001548120 726
+ .quad 0x03FE126ECB7F8F007
+ .quad 0x03FE12A57FDA37091 # 0.536418910396 727
+ .quad 0x03FE12A57FDA37091
+ .quad 0x03FE12F799594EFBC # 0.537045280601 728
+ .quad 0x03FE12F799594EFBC
+ .quad 0x03FE1349C004AFB00 # 0.537672043392 729
+ .quad 0x03FE1349C004AFB00
+ .quad 0x03FE139BF3E094003 # 0.538299199261 730
+ .quad 0x03FE139BF3E094003
+ .quad 0x03FE13D2C873C5E13 # 0.538717521794 731
+ .quad 0x03FE13D2C873C5E13
+ .quad 0x03FE142512549C16C # 0.539345333889 732
+ .quad 0x03FE142512549C16C
+ .quad 0x03FE14776971477F1 # 0.539973540381 733
+ .quad 0x03FE14776971477F1
+ .quad 0x03FE14C9CDCE0A74D # 0.540602141763 734
+ .quad 0x03FE14C9CDCE0A74D
+ .quad 0x03FE1500C2BFD1561 # 0.541021428981 735
+ .quad 0x03FE1500C2BFD1561
+ .quad 0x03FE15533D3B8D7B3 # 0.541650689621 736
+ .quad 0x03FE15533D3B8D7B3
+ .quad 0x03FE15A5C502C6DC5 # 0.542280346478 737
+ .quad 0x03FE15A5C502C6DC5
+ .quad 0x03FE15DCD1973457B # 0.542700338085 738
+ .quad 0x03FE15DCD1973457B
+ .quad 0x03FE162F6F9071F76 # 0.543330656416 739
+ .quad 0x03FE162F6F9071F76
+ .quad 0x03FE16821AE0A13C6 # 0.543961372300 740
+ .quad 0x03FE16821AE0A13C6
+ .quad 0x03FE16B93F2C12808 # 0.544382070665 741
+ .quad 0x03FE16B93F2C12808
+ .quad 0x03FE170C00C169B51 # 0.545013450251 742
+ .quad 0x03FE170C00C169B51
+ .quad 0x03FE175ECFB935CC6 # 0.545645228728 743
+ .quad 0x03FE175ECFB935CC6
+ .quad 0x03FE17B1AC17CBD5B # 0.546277406602 744
+ .quad 0x03FE17B1AC17CBD5B
+ .quad 0x03FE17E8F12052E8A # 0.546699080654 745
+ .quad 0x03FE17E8F12052E8A
+ .quad 0x03FE183BE3DE8A7AF # 0.547331925312 746
+ .quad 0x03FE183BE3DE8A7AF
+ .quad 0x03FE188EE40F23CA7 # 0.547965170715 747
+ .quad 0x03FE188EE40F23CA7
+ .quad 0x03FE18C640FF75F06 # 0.548387557205 748
+ .quad 0x03FE18C640FF75F06
+ .quad 0x03FE191957A30FA51 # 0.549021471648 749
+ .quad 0x03FE191957A30FA51
+ .quad 0x03FE196C7BC4B1F3A # 0.549655788193 750
+ .quad 0x03FE196C7BC4B1F3A
+ .quad 0x03FE19A3F0B1860BD # 0.550078889532 751
+ .quad 0x03FE19A3F0B1860BD
+ .quad 0x03FE19F72B59A0CEC # 0.550713877383 752
+ .quad 0x03FE19F72B59A0CEC
+ .quad 0x03FE1A4A738B7A33C # 0.551349268700 753
+ .quad 0x03FE1A4A738B7A33C
+ .quad 0x03FE1A820089A2156 # 0.551773087312 754
+ .quad 0x03FE1A820089A2156
+ .quad 0x03FE1AD55F55855C8 # 0.552409152212 755
+ .quad 0x03FE1AD55F55855C8
+ .quad 0x03FE1B28CBB6EC93E # 0.553045621948 756
+ .quad 0x03FE1B28CBB6EC93E
+ .quad 0x03FE1B6070DB553D8 # 0.553470160269 757
+ .quad 0x03FE1B6070DB553D8
+ .quad 0x03FE1BB3F3EA714F6 # 0.554107305878 758
+ .quad 0x03FE1BB3F3EA714F6
+ .quad 0x03FE1BEBA8316EF2C # 0.554532295260 759
+ .quad 0x03FE1BEBA8316EF2C
+ .quad 0x03FE1C3F41FA97C6B # 0.555170118179 760
+ .quad 0x03FE1C3F41FA97C6B
+ .quad 0x03FE1C92E96C86020 # 0.555808348176 761
+ .quad 0x03FE1C92E96C86020
+ .quad 0x03FE1CCAB5FBFFEE1 # 0.556234061252 762
+ .quad 0x03FE1CCAB5FBFFEE1
+ .quad 0x03FE1D1E743BCFC47 # 0.556872970868 763
+ .quad 0x03FE1D1E743BCFC47
+ .quad 0x03FE1D72403052E75 # 0.557512288951 764
+ .quad 0x03FE1D72403052E75
+ .quad 0x03FE1DAA251D7E433 # 0.557938728190 765
+ .quad 0x03FE1DAA251D7E433
+ .quad 0x03FE1DFE07F3D1DAB # 0.558578728212 766
+ .quad 0x03FE1DFE07F3D1DAB
+ .quad 0x03FE1E35FC265D75E # 0.559005622562 767
+ .quad 0x03FE1E35FC265D75E
+ .quad 0x03FE1E89F5EB04126 # 0.559646305979 768
+ .quad 0x03FE1E89F5EB04126
+ .quad 0x03FE1EDDFD77E1FEF # 0.560287400135 769
+ .quad 0x03FE1EDDFD77E1FEF
+ .quad 0x03FE1F160A2AD0DA3 # 0.560715024687 770
+ .quad 0x03FE1F160A2AD0DA3
+ .quad 0x03FE1F6A28BA1B476 # 0.561356804579 771
+ .quad 0x03FE1F6A28BA1B476
+ .quad 0x03FE1FBE551DB43C1 # 0.561998996616 772
+ .quad 0x03FE1FBE551DB43C1
+ .quad 0x03FE1FF67A6684F47 # 0.562427353873 773
+ .quad 0x03FE1FF67A6684F47
+ .quad 0x03FE204ABDE0BE5DF # 0.563070233998 774
+ .quad 0x03FE204ABDE0BE5DF
+ .quad 0x03FE2082F29233211 # 0.563499050471 775
+ .quad 0x03FE2082F29233211
+ .quad 0x03FE20D74D2FBAFE4 # 0.564142620160 776
+ .quad 0x03FE20D74D2FBAFE4
+ .quad 0x03FE210F91524B469 # 0.564571896835 777
+ .quad 0x03FE210F91524B469
+ .quad 0x03FE2164031FDA0B0 # 0.565216157568 778
+ .quad 0x03FE2164031FDA0B0
+ .quad 0x03FE21B882DD26040 # 0.565860833641 779
+ .quad 0x03FE21B882DD26040
+ .quad 0x03FE21F0DFC65CEEC # 0.566290848698 780
+ .quad 0x03FE21F0DFC65CEEC
+ .quad 0x03FE224576C81FFE0 # 0.566936218194 781
+ .quad 0x03FE224576C81FFE0
+ .quad 0x03FE227DE33896A44 # 0.567366696031 782
+ .quad 0x03FE227DE33896A44
+ .quad 0x03FE22D2918BA4A31 # 0.568012760445 783
+ .quad 0x03FE22D2918BA4A31
+ .quad 0x03FE23274DE272A83 # 0.568659242528 784
+ .quad 0x03FE23274DE272A83
+ .quad 0x03FE235FD33D232FC # 0.569090462888 785
+ .quad 0x03FE235FD33D232FC
+ .quad 0x03FE23B4A6F9D8688 # 0.569737642287 786
+ .quad 0x03FE23B4A6F9D8688
+ .quad 0x03FE23ED3BF21CA33 # 0.570169328026 787
+ .quad 0x03FE23ED3BF21CA33
+ .quad 0x03FE24422721A89D7 # 0.570817206248 788
+ .quad 0x03FE24422721A89D7
+ .quad 0x03FE247ACBC023D2B # 0.571249358372 789
+ .quad 0x03FE247ACBC023D2B
+ .quad 0x03FE24CFCE6F80D9B # 0.571897936927 790
+ .quad 0x03FE24CFCE6F80D9B
+ .quad 0x03FE250882BCDD7D8 # 0.572330556445 791
+ .quad 0x03FE250882BCDD7D8
+ .quad 0x03FE255D9CF910A56 # 0.572979836849 792
+ .quad 0x03FE255D9CF910A56
+ .quad 0x03FE25B2C55CD5762 # 0.573629539091 793
+ .quad 0x03FE25B2C55CD5762
+ .quad 0x03FE25EB92D41992D # 0.574062908546 794
+ .quad 0x03FE25EB92D41992D
+ .quad 0x03FE2640D2D99FFEA # 0.574713315073 795
+ .quad 0x03FE2640D2D99FFEA
+ .quad 0x03FE2679B0166F51C # 0.575147154559 796
+ .quad 0x03FE2679B0166F51C
+ .quad 0x03FE26CF07CAD8B00 # 0.575798266899 797
+ .quad 0x03FE26CF07CAD8B00
+ .quad 0x03FE2707F4D5F7C40 # 0.576232577438 798
+ .quad 0x03FE2707F4D5F7C40
+ .quad 0x03FE275D644670606 # 0.576884397124 799
+ .quad 0x03FE275D644670606
+ .quad 0x03FE27966128AB11B # 0.577319179739 800
+ .quad 0x03FE27966128AB11B
+ .quad 0x03FE27EBE8626A387 # 0.577971708311 801
+ .quad 0x03FE27EBE8626A387
+ .quad 0x03FE2824F52493BD2 # 0.578406964030 802
+ .quad 0x03FE2824F52493BD2
+ .quad 0x03FE287A9434DBC7B # 0.579060203030 803
+ .quad 0x03FE287A9434DBC7B
+ .quad 0x03FE28B3B0DFCEB80 # 0.579495932884 804
+ .quad 0x03FE28B3B0DFCEB80
+ .quad 0x03FE290967D3ED18D # 0.580149883861 805
+ .quad 0x03FE290967D3ED18D
+ .quad 0x03FE294294708B773 # 0.580586088885 806
+ .quad 0x03FE294294708B773
+ .quad 0x03FE29986355D8C69 # 0.581240753393 807
+ .quad 0x03FE29986355D8C69
+ .quad 0x03FE29D19FED0C082 # 0.581677434622 808
+ .quad 0x03FE29D19FED0C082
+ .quad 0x03FE2A2786D0EC107 # 0.582332814220 809
+ .quad 0x03FE2A2786D0EC107
+ .quad 0x03FE2A60D36BA5253 # 0.582769972697 810
+ .quad 0x03FE2A60D36BA5253
+ .quad 0x03FE2AB6D25B86EF7 # 0.583426068948 811
+ .quad 0x03FE2AB6D25B86EF7
+ .quad 0x03FE2AF02F02BE4AB # 0.583863705716 812
+ .quad 0x03FE2AF02F02BE4AB
+ .quad 0x03FE2B46460C1C2B3 # 0.584520520190 813
+ .quad 0x03FE2B46460C1C2B3
+ .quad 0x03FE2B7FB2C8D1CC1 # 0.584958636297 814
+ .quad 0x03FE2B7FB2C8D1CC1
+ .quad 0x03FE2BD5E1F9316F2 # 0.585616170568 815
+ .quad 0x03FE2BD5E1F9316F2
+ .quad 0x03FE2C0F5ED46CE8D # 0.586054767066 816
+ .quad 0x03FE2C0F5ED46CE8D
+ .quad 0x03FE2C65A6395F5F5 # 0.586713022712 817
+ .quad 0x03FE2C65A6395F5F5
+ .quad 0x03FE2C9F333C2FE1E # 0.587152100656 818
+ .quad 0x03FE2C9F333C2FE1E
+ .quad 0x03FE2CF592E351AE5 # 0.587811079263 819
+ .quad 0x03FE2CF592E351AE5
+ .quad 0x03FE2D2F3016CE0EF # 0.588250639709 820
+ .quad 0x03FE2D2F3016CE0EF
+ .quad 0x03FE2D85A80DC7324 # 0.588910342867 821
+ .quad 0x03FE2D85A80DC7324
+ .quad 0x03FE2DBF557B0DF43 # 0.589350386878 822
+ .quad 0x03FE2DBF557B0DF43
+ .quad 0x03FE2E15E5CF91FA7 # 0.590010816181 823
+ .quad 0x03FE2E15E5CF91FA7
+ .quad 0x03FE2E4FA37FC9577 # 0.590451344823 824
+ .quad 0x03FE2E4FA37FC9577
+ .quad 0x03FE2E8967B3BF4E1 # 0.590892067615 825
+ .quad 0x03FE2E8967B3BF4E1
+ .quad 0x03FE2EE01A3BED567 # 0.591553516212 826
+ .quad 0x03FE2EE01A3BED567
+ .quad 0x03FE2F19EEBFB00BA # 0.591994725131 827
+ .quad 0x03FE2F19EEBFB00BA
+ .quad 0x03FE2F70B9C67A7C2 # 0.592656903723 828
+ .quad 0x03FE2F70B9C67A7C2
+ .quad 0x03FE2FAA9EA342D04 # 0.593098599843 829
+ .quad 0x03FE2FAA9EA342D04
+ .quad 0x03FE3001823684D73 # 0.593761510043 830
+ .quad 0x03FE3001823684D73
+ .quad 0x03FE303B7775937EF # 0.594203694441 831
+ .quad 0x03FE303B7775937EF
+ .quad 0x03FE309273A3340FC # 0.594867337868 832
+ .quad 0x03FE309273A3340FC
+ .quad 0x03FE30CC794DD19D0 # 0.595310011625 833
+ .quad 0x03FE30CC794DD19D0
+ .quad 0x03FE3106858C76BB7 # 0.595752881428 834
+ .quad 0x03FE3106858C76BB7
+ .quad 0x03FE315DA4434068B # 0.596417554101 835
+ .quad 0x03FE315DA4434068B
+ .quad 0x03FE3197C0FA80E6A # 0.596860914783 836
+ .quad 0x03FE3197C0FA80E6A
+ .quad 0x03FE31EEF86D36EF1 # 0.597526324589 837
+ .quad 0x03FE31EEF86D36EF1
+ .quad 0x03FE322925A66E62D # 0.597970177237 838
+ .quad 0x03FE322925A66E62D
+ .quad 0x03FE328075E32022F # 0.598636325813 839
+ .quad 0x03FE328075E32022F
+ .quad 0x03FE32BAB3A7B21E9 # 0.599080671521 840
+ .quad 0x03FE32BAB3A7B21E9
+ .quad 0x03FE32F4F80D0B1BD # 0.599525214760 841
+ .quad 0x03FE32F4F80D0B1BD
+ .quad 0x03FE334C6B15D30DD # 0.600192400374 842
+ .quad 0x03FE334C6B15D30DD
+ .quad 0x03FE3386C013B90D6 # 0.600637438209 843
+ .quad 0x03FE3386C013B90D6
+ .quad 0x03FE33DE4C086C40A # 0.601305366543 844
+ .quad 0x03FE33DE4C086C40A
+ .quad 0x03FE3418B1A85622C # 0.601750900077 845
+ .quad 0x03FE3418B1A85622C
+ .quad 0x03FE34531DF21CFE3 # 0.602196632199 846
+ .quad 0x03FE34531DF21CFE3
+ .quad 0x03FE34AACCE299BA5 # 0.602865603124 847
+ .quad 0x03FE34AACCE299BA5
+ .quad 0x03FE34E549DBB21EF # 0.603311832493 848
+ .quad 0x03FE34E549DBB21EF
+ .quad 0x03FE353D11DA4F855 # 0.603981550121 849
+ .quad 0x03FE353D11DA4F855
+ .quad 0x03FE35779F8C43D6D # 0.604428277847 850
+ .quad 0x03FE35779F8C43D6D
+ .quad 0x03FE35B233F13DD4A # 0.604875205229 851
+ .quad 0x03FE35B233F13DD4A
+ .quad 0x03FE360A1F1BBA738 # 0.605545971045 852
+ .quad 0x03FE360A1F1BBA738
+ .quad 0x03FE3644C446F97BC # 0.605993398346 853
+ .quad 0x03FE3644C446F97BC
+ .quad 0x03FE367F702A9EA94 # 0.606441025927 854
+ .quad 0x03FE367F702A9EA94
+ .quad 0x03FE36D77E9D34FD7 # 0.607112843218 855
+ .quad 0x03FE36D77E9D34FD7
+ .quad 0x03FE37123B54987B7 # 0.607560972287 856
+ .quad 0x03FE37123B54987B7
+ .quad 0x03FE376A630C0A1D6 # 0.608233542652 857
+ .quad 0x03FE376A630C0A1D6
+ .quad 0x03FE37A530A0D5A31 # 0.608682174333 858
+ .quad 0x03FE37A530A0D5A31
+ .quad 0x03FE37E004F74E13B # 0.609131007374 859
+ .quad 0x03FE37E004F74E13B
+ .quad 0x03FE383850278CFD9 # 0.609804634884 860
+ .quad 0x03FE383850278CFD9
+ .quad 0x03FE3873356902AB7 # 0.610253972119 861
+ .quad 0x03FE3873356902AB7
+ .quad 0x03FE38AE2171976E8 # 0.610703511349 862
+ .quad 0x03FE38AE2171976E8
+ .quad 0x03FE390690373AFFF # 0.611378199331 863
+ .quad 0x03FE390690373AFFF
+ .quad 0x03FE39418D3872A53 # 0.611828244343 864
+ .quad 0x03FE39418D3872A53
+ .quad 0x03FE397C91064221F # 0.612278491987 865
+ .quad 0x03FE397C91064221F
+ .quad 0x03FE39D5237E045A5 # 0.612954243787 866
+ .quad 0x03FE39D5237E045A5
+ .quad 0x03FE3A1038522CE82 # 0.613404998809 867
+ .quad 0x03FE3A1038522CE82
+ .quad 0x03FE3A68E45AD354B # 0.614081512534 868
+ .quad 0x03FE3A68E45AD354B
+ .quad 0x03FE3AA40A3F2A68B # 0.614532776080 869
+ .quad 0x03FE3AA40A3F2A68B
+ .quad 0x03FE3ADF36F98A182 # 0.614984243356 870
+ .quad 0x03FE3ADF36F98A182
+ .quad 0x03FE3B3806E5DF340 # 0.615661826668 871
+ .quad 0x03FE3B3806E5DF340
+ .quad 0x03FE3B7344BE40311 # 0.616113804077 872
+ .quad 0x03FE3B7344BE40311
+ .quad 0x03FE3BAE897234A87 # 0.616565985862 873
+ .quad 0x03FE3BAE897234A87
+ .quad 0x03FE3C077D5F51881 # 0.617244642149 874
+ .quad 0x03FE3C077D5F51881
+ .quad 0x03FE3C42D33F2AE7B # 0.617697335683 875
+ .quad 0x03FE3C42D33F2AE7B
+ .quad 0x03FE3C7E30002960C # 0.618150234241 876
+ .quad 0x03FE3C7E30002960C
+ .quad 0x03FE3CD7480B4A8A3 # 0.618829966906 877
+ .quad 0x03FE3CD7480B4A8A3
+ .quad 0x03FE3D12B60622748 # 0.619283378838 878
+ .quad 0x03FE3D12B60622748
+ .quad 0x03FE3D4E2AE7B7E2B # 0.619736996447 879
+ .quad 0x03FE3D4E2AE7B7E2B
+ .quad 0x03FE3D89A6B1A558D # 0.620190819917 880
+ .quad 0x03FE3D89A6B1A558D
+ .quad 0x03FE3DE2ED57B1F9B # 0.620871941524 881
+ .quad 0x03FE3DE2ED57B1F9B
+ .quad 0x03FE3E1E7A6D8330E # 0.621326280468 882
+ .quad 0x03FE3E1E7A6D8330E
+ .quad 0x03FE3E5A0E714DA6E # 0.621780825931 883
+ .quad 0x03FE3E5A0E714DA6E
+ .quad 0x03FE3EB37978B85B6 # 0.622463031756 884
+ .quad 0x03FE3EB37978B85B6
+ .quad 0x03FE3EEF1ED68236B # 0.622918094335 885
+ .quad 0x03FE3EEF1ED68236B
+ .quad 0x03FE3F2ACB27ED6C7 # 0.623373364090 886
+ .quad 0x03FE3F2ACB27ED6C7
+ .quad 0x03FE3F845AAE68C81 # 0.624056657591 887
+ .quad 0x03FE3F845AAE68C81
+ .quad 0x03FE3FC0186800514 # 0.624512446113 888
+ .quad 0x03FE3FC0186800514
+ .quad 0x03FE3FFBDD1AE8406 # 0.624968442473 889
+ .quad 0x03FE3FFBDD1AE8406
+ .quad 0x03FE4037A8C8C197A # 0.625424646860 890
+ .quad 0x03FE4037A8C8C197A
+ .quad 0x03FE409167679DD99 # 0.626109343909 891
+ .quad 0x03FE409167679DD99
+ .quad 0x03FE40CD448FF6DD6 # 0.626566069196 892
+ .quad 0x03FE40CD448FF6DD6
+ .quad 0x03FE410928B8F950F # 0.627023003177 893
+ .quad 0x03FE410928B8F950F
+ .quad 0x03FE41630C1B50AFF # 0.627708795866 894
+ .quad 0x03FE41630C1B50AFF
+ .quad 0x03FE419F01CD27AD0 # 0.628166252416 895
+ .quad 0x03FE419F01CD27AD0
+ .quad 0x03FE41DAFE85672B9 # 0.628623918328 896
+ .quad 0x03FE41DAFE85672B9
+ .quad 0x03FE42170245B4C6A # 0.629081793794 897
+ .quad 0x03FE42170245B4C6A
+ .quad 0x03FE42711518DF546 # 0.629769000326 898
+ .quad 0x03FE42711518DF546
+ .quad 0x03FE42AD2A74888A0 # 0.630227400518 899
+ .quad 0x03FE42AD2A74888A0
+ .quad 0x03FE42E946DE080C0 # 0.630686010936 900
+ .quad 0x03FE42E946DE080C0
+ .quad 0x03FE43437EB9D9424 # 0.631374321162 901
+ .quad 0x03FE43437EB9D9424
+ .quad 0x03FE437FACCD31C10 # 0.631833457993 902
+ .quad 0x03FE437FACCD31C10
+ .quad 0x03FE43BBE1F42FE09 # 0.632292805727 903
+ .quad 0x03FE43BBE1F42FE09
+ .quad 0x03FE43F81E307DE5E # 0.632752364559 904
+ .quad 0x03FE43F81E307DE5E
+ .quad 0x03FE445285D68EA69 # 0.633442099038 905
+ .quad 0x03FE445285D68EA69
+ .quad 0x03FE448ED3CF71355 # 0.633902186463 906
+ .quad 0x03FE448ED3CF71355
+ .quad 0x03FE44CB28E37C3EE # 0.634362485666 907
+ .quad 0x03FE44CB28E37C3EE
+ .quad 0x03FE450785145CAFE # 0.634822996841 908
+ .quad 0x03FE450785145CAFE
+ .quad 0x03FE45621CB769366 # 0.635514161481 909
+ .quad 0x03FE45621CB769366
+ .quad 0x03FE459E8AB7B799D # 0.635975203444 910
+ .quad 0x03FE459E8AB7B799D
+ .quad 0x03FE45DAFFDABD4DB # 0.636436458065 911
+ .quad 0x03FE45DAFFDABD4DB
+ .quad 0x03FE46177C2229EC0 # 0.636897925539 912
+ .quad 0x03FE46177C2229EC0
+ .quad 0x03FE467243F53F69E # 0.637590526283 913
+ .quad 0x03FE467243F53F69E
+ .quad 0x03FE46AED21F117FC # 0.638052526753 914
+ .quad 0x03FE46AED21F117FC
+ .quad 0x03FE46EB677335D13 # 0.638514740766 915
+ .quad 0x03FE46EB677335D13
+ .quad 0x03FE472803F35EAAE # 0.638977168520 916
+ .quad 0x03FE472803F35EAAE
+ .quad 0x03FE4764A7A13EF3B # 0.639439810212 917
+ .quad 0x03FE4764A7A13EF3B
+ .quad 0x03FE47BFAA9F80271 # 0.640134174319 918
+ .quad 0x03FE47BFAA9F80271
+ .quad 0x03FE47FC60471DAF8 # 0.640597351724 919
+ .quad 0x03FE47FC60471DAF8
+ .quad 0x03FE48391D226992D # 0.641060743762 920
+ .quad 0x03FE48391D226992D
+ .quad 0x03FE4875E1331971E # 0.641524350631 921
+ .quad 0x03FE4875E1331971E
+ .quad 0x03FE48D114D3FB884 # 0.642220164181 922
+ .quad 0x03FE48D114D3FB884
+ .quad 0x03FE490DEAF1A3FC8 # 0.642684309003 923
+ .quad 0x03FE490DEAF1A3FC8
+ .quad 0x03FE494AC84AB0ED3 # 0.643148669355 924
+ .quad 0x03FE494AC84AB0ED3
+ .quad 0x03FE4987ACE0DABB0 # 0.643613245438 925
+ .quad 0x03FE4987ACE0DABB0
+ .quad 0x03FE49C498B5DA63F # 0.644078037452 926
+ .quad 0x03FE49C498B5DA63F
+ .quad 0x03FE4A20080EF10B2 # 0.644775630783 927
+ .quad 0x03FE4A20080EF10B2
+ .quad 0x03FE4A5D060894B8C # 0.645240963504 928
+ .quad 0x03FE4A5D060894B8C
+ .quad 0x03FE4A9A0B471A943 # 0.645706512861 929
+ .quad 0x03FE4A9A0B471A943
+ .quad 0x03FE4AD717CC3E626 # 0.646172279055 930
+ .quad 0x03FE4AD717CC3E626
+ .quad 0x03FE4B142B99BC871 # 0.646638262288 931
+ .quad 0x03FE4B142B99BC871
+ .quad 0x03FE4B6FD6F970C1F # 0.647337644529 932
+ .quad 0x03FE4B6FD6F970C1F
+ .quad 0x03FE4BACFD036D080 # 0.647804171246 933
+ .quad 0x03FE4BACFD036D080
+ .quad 0x03FE4BEA2A5BDBE87 # 0.648270915712 934
+ .quad 0x03FE4BEA2A5BDBE87
+ .quad 0x03FE4C275F047C956 # 0.648737878130 935
+ .quad 0x03FE4C275F047C956
+ .quad 0x03FE4C649AFF0EE16 # 0.649205058703 936
+ .quad 0x03FE4C649AFF0EE16
+ .quad 0x03FE4CC082B46485A # 0.649906239052 937
+ .quad 0x03FE4CC082B46485A
+ .quad 0x03FE4CFDD1037E37C # 0.650373965908 938
+ .quad 0x03FE4CFDD1037E37C
+ .quad 0x03FE4D3B26AAADDD9 # 0.650841911635 939
+ .quad 0x03FE4D3B26AAADDD9
+ .quad 0x03FE4D7883ABB61F6 # 0.651310076438 940
+ .quad 0x03FE4D7883ABB61F6
+ .quad 0x03FE4DB5E8085A477 # 0.651778460521 941
+ .quad 0x03FE4DB5E8085A477
+ .quad 0x03FE4DF353C25E42B # 0.652247064091 942
+ .quad 0x03FE4DF353C25E42B
+ .quad 0x03FE4E4F832C560DD # 0.652950381434 943
+ .quad 0x03FE4E4F832C560DD
+ .quad 0x03FE4E8D015786F16 # 0.653419534621 944
+ .quad 0x03FE4E8D015786F16
+ .quad 0x03FE4ECA86E64A683 # 0.653888908016 945
+ .quad 0x03FE4ECA86E64A683
+ .quad 0x03FE4F0813DA673DD # 0.654358501826 946
+ .quad 0x03FE4F0813DA673DD
+ .quad 0x03FE4F45A835A4E19 # 0.654828316258 947
+ .quad 0x03FE4F45A835A4E19
+ .quad 0x03FE4F8343F9CB678 # 0.655298351519 948
+ .quad 0x03FE4F8343F9CB678
+ .quad 0x03FE4FDFBB88A119A # 0.656003818920 949
+ .quad 0x03FE4FDFBB88A119A
+ .quad 0x03FE501D69DADD660 # 0.656474407164 950
+ .quad 0x03FE501D69DADD660
+ .quad 0x03FE505B1F9C43ED7 # 0.656945216966 951
+ .quad 0x03FE505B1F9C43ED7
+ .quad 0x03FE5098DCCE9FABA # 0.657416248534 952
+ .quad 0x03FE5098DCCE9FABA
+ .quad 0x03FE50D6A173BC425 # 0.657887502077 953
+ .quad 0x03FE50D6A173BC425
+ .quad 0x03FE51146D8D65F98 # 0.658358977805 954
+ .quad 0x03FE51146D8D65F98
+ .quad 0x03FE5152411D69C03 # 0.658830675927 955
+ .quad 0x03FE5152411D69C03
+ .quad 0x03FE51AF0C774A2D0 # 0.659538640558 956
+ .quad 0x03FE51AF0C774A2D0
+ .quad 0x03FE51ECF2B713F8A # 0.660010895584 957
+ .quad 0x03FE51ECF2B713F8A
+ .quad 0x03FE522AE0738A3D8 # 0.660483373741 958
+ .quad 0x03FE522AE0738A3D8
+ .quad 0x03FE5268D5AE7CDCB # 0.660956075239 959
+ .quad 0x03FE5268D5AE7CDCB
+ .quad 0x03FE52A6D269BC600 # 0.661429000289 960
+ .quad 0x03FE52A6D269BC600
+ .quad 0x03FE52E4D6A719F9B # 0.661902149103 961
+ .quad 0x03FE52E4D6A719F9B
+ .quad 0x03FE5322E26867857 # 0.662375521893 962
+ .quad 0x03FE5322E26867857
+ .quad 0x03FE53800225BA6E2 # 0.663086001497 963
+ .quad 0x03FE53800225BA6E2
+ .quad 0x03FE53BE20B8DA502 # 0.663559935155 964
+ .quad 0x03FE53BE20B8DA502
+ .quad 0x03FE53FC46D64DDD1 # 0.664034093533 965
+ .quad 0x03FE53FC46D64DDD1
+ .quad 0x03FE543A747FE9ED6 # 0.664508476843 966
+ .quad 0x03FE543A747FE9ED6
+ .quad 0x03FE5478A9B78404C # 0.664983085300 967
+ .quad 0x03FE5478A9B78404C
+ .quad 0x03FE54B6E67EF251C # 0.665457919117 968
+ .quad 0x03FE54B6E67EF251C
+ .quad 0x03FE54F52AD80BAE9 # 0.665932978509 969
+ .quad 0x03FE54F52AD80BAE9
+ .quad 0x03FE553376C4A7A16 # 0.666408263689 970
+ .quad 0x03FE553376C4A7A16
+ .quad 0x03FE5571CA469E5C9 # 0.666883774872 971
+ .quad 0x03FE5571CA469E5C9
+ .quad 0x03FE55CF55C5A5437 # 0.667597465874 972
+ .quad 0x03FE55CF55C5A5437
+ .quad 0x03FE560DBC45153C7 # 0.668073543008 973
+ .quad 0x03FE560DBC45153C7
+ .quad 0x03FE564C2A6059FE7 # 0.668549846899 974
+ .quad 0x03FE564C2A6059FE7
+ .quad 0x03FE568AA0194EC6E # 0.669026377763 975
+ .quad 0x03FE568AA0194EC6E
+ .quad 0x03FE56C91D71CF810 # 0.669503135817 976
+ .quad 0x03FE56C91D71CF810
+ .quad 0x03FE5707A26BB8C66 # 0.669980121278 977
+ .quad 0x03FE5707A26BB8C66
+ .quad 0x03FE57462F08E7DF5 # 0.670457334363 978
+ .quad 0x03FE57462F08E7DF5
+ .quad 0x03FE5784C34B3AC30 # 0.670934775289 979
+ .quad 0x03FE5784C34B3AC30
+ .quad 0x03FE57C35F3490183 # 0.671412444273 980
+ .quad 0x03FE57C35F3490183
+ .quad 0x03FE580202C6C7353 # 0.671890341535 981
+ .quad 0x03FE580202C6C7353
+ .quad 0x03FE5840AE03C0204 # 0.672368467291 982
+ .quad 0x03FE5840AE03C0204
+ .quad 0x03FE589EBD437CA31 # 0.673086084831 983
+ .quad 0x03FE589EBD437CA31
+ .quad 0x03FE58DD7BB392B30 # 0.673564782782 984
+ .quad 0x03FE58DD7BB392B30
+ .quad 0x03FE591C41D500163 # 0.674043709994 985
+ .quad 0x03FE591C41D500163
+ .quad 0x03FE595B0FA9A7EF1 # 0.674522866688 986
+ .quad 0x03FE595B0FA9A7EF1
+ .quad 0x03FE5999E5336E121 # 0.675002253082 987
+ .quad 0x03FE5999E5336E121
+ .quad 0x03FE59D8C2743705E # 0.675481869398 988
+ .quad 0x03FE59D8C2743705E
+ .quad 0x03FE5A17A76DE803B # 0.675961715857 989
+ .quad 0x03FE5A17A76DE803B
+ .quad 0x03FE5A56942266F7B # 0.676441792678 990
+ .quad 0x03FE5A56942266F7B
+ .quad 0x03FE5A9588939A810 # 0.676922100084 991
+ .quad 0x03FE5A9588939A810
+ .quad 0x03FE5AD484C369F2D # 0.677402638296 992
+ .quad 0x03FE5AD484C369F2D
+ .quad 0x03FE5B1388B3BD53E # 0.677883407536 993
+ .quad 0x03FE5B1388B3BD53E
+ .quad 0x03FE5B5294667D5F7 # 0.678364408027 994
+ .quad 0x03FE5B5294667D5F7
+ .quad 0x03FE5B91A7DD93852 # 0.678845639990 995
+ .quad 0x03FE5B91A7DD93852
+ .quad 0x03FE5BD0C31AE9E9D # 0.679327103649 996
+ .quad 0x03FE5BD0C31AE9E9D
+ .quad 0x03FE5C2F7A8ED5E5B # 0.680049734055 997
+ .quad 0x03FE5C2F7A8ED5E5B
+ .quad 0x03FE5C6EA94431EF9 # 0.680531777930 998
+ .quad 0x03FE5C6EA94431EF9
+ .quad 0x03FE5CADDFC6874F5 # 0.681014054284 999
+ .quad 0x03FE5CADDFC6874F5
+ .quad 0x03FE5CED1E17C35C6 # 0.681496563340 1000
+ .quad 0x03FE5CED1E17C35C6
+ .quad 0x03FE5D2C6439D4252 # 0.681979305324 1001
+ .quad 0x03FE5D2C6439D4252
+ .quad 0x03FE5D6BB22EA86F6 # 0.682462280460 1002
+ .quad 0x03FE5D6BB22EA86F6
+ .quad 0x03FE5DAB07F82FB84 # 0.682945488974 1003
+ .quad 0x03FE5DAB07F82FB84
+ .quad 0x03FE5DEA65985A350 # 0.683428931091 1004
+ .quad 0x03FE5DEA65985A350
+ .quad 0x03FE5E29CB1118D32 # 0.683912607038 1005
+ .quad 0x03FE5E29CB1118D32
+ .quad 0x03FE5E6938645D390 # 0.684396517040 1006
+ .quad 0x03FE5E6938645D390
+ .quad 0x03FE5EA8AD9419C5B # 0.684880661324 1007
+ .quad 0x03FE5EA8AD9419C5B
+ .quad 0x03FE5EE82AA241920 # 0.685365040118 1008
+ .quad 0x03FE5EE82AA241920
+ .quad 0x03FE5F27AF90C8705 # 0.685849653648 1009
+ .quad 0x03FE5F27AF90C8705
+ .quad 0x03FE5F673C61A2ED2 # 0.686334502142 1010
+ .quad 0x03FE5F673C61A2ED2
+ .quad 0x03FE5FA6D116C64F7 # 0.686819585829 1011
+ .quad 0x03FE5FA6D116C64F7
+ .quad 0x03FE5FE66DB228992 # 0.687304904936 1012
+ .quad 0x03FE5FE66DB228992
+ .quad 0x03FE60261235C0874 # 0.687790459692 1013
+ .quad 0x03FE60261235C0874
+ .quad 0x03FE6065BEA385926 # 0.688276250325 1014
+ .quad 0x03FE6065BEA385926
+ .quad 0x03FE60A572FD6FEF1 # 0.688762277066 1015
+ .quad 0x03FE60A572FD6FEF1
+ .quad 0x03FE60E52F45788E4 # 0.689248540144 1016
+ .quad 0x03FE60E52F45788E4
+ .quad 0x03FE6124F37D991D4 # 0.689735039789 1017
+ .quad 0x03FE6124F37D991D4
+ .quad 0x03FE6164BFA7CC06C # 0.690221776231 1018
+ .quad 0x03FE6164BFA7CC06C
+ .quad 0x03FE61A493C60C729 # 0.690708749700 1019
+ .quad 0x03FE61A493C60C729
+ .quad 0x03FE61E46FDA56466 # 0.691195960429 1020
+ .quad 0x03FE61E46FDA56466
+ .quad 0x03FE622453E6A6263 # 0.691683408647 1021
+ .quad 0x03FE622453E6A6263
+ .quad 0x03FE62643FECF9743 # 0.692171094587 1022
+ .quad 0x03FE62643FECF9743
+ .quad 0x03FE62A433EF4E51A # 0.692659018480 1023
+ .quad 0x03FE62A433EF4E51A
diff --git a/src/gas/vrdacos.S b/src/gas/vrdacos.S
new file mode 100644
index 0000000..5e2b3a4
--- /dev/null
+++ b/src/gas/vrdacos.S
@@ -0,0 +1,3118 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdacos.s
+#
+# An array implementation of the cos libm function.
+#
+# Prototype:
+#
+# void vrda_cos(int n, double *x, double *y);
+#
+#Computes Cosine of x for an array of input values.
+#Places the results into the supplied y array.
+#Does not perform error checking.
+#Denormal inputs may produce unexpected results
+#Author: Harsha Jagasia
+#Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+
+.Lcosarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03fa5555555555555
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6
+ .quad 0x0bda907db46cc5e42
+.Lsinarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bfc5555555555555
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03f81111111110bb3
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0bf2a01a019e83e5c
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03ec71de3796cde01
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0be5ae600b42fdfa7
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x03de5e0b2f9a43bb8
+.Lsincosarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x0bda907db46cc5e42
+.Lcossinarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0bda907db46cc5e42
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+
+.align 16
+.Levencos_oddsin_tbl:
+ .quad .Lcoscos_coscos_piby4 # 0 *
+ .quad .Lcoscos_cossin_piby4 # 1 +
+ .quad .Lcoscos_sincos_piby4 # 2
+ .quad .Lcoscos_sinsin_piby4 # 3 +
+
+ .quad .Lcossin_coscos_piby4 # 4
+ .quad .Lcossin_cossin_piby4 # 5 *
+ .quad .Lcossin_sincos_piby4 # 6
+ .quad .Lcossin_sinsin_piby4 # 7
+
+ .quad .Lsincos_coscos_piby4 # 8
+ .quad .Lsincos_cossin_piby4 # 9
+ .quad .Lsincos_sincos_piby4 # 10 *
+ .quad .Lsincos_sinsin_piby4 # 11
+
+ .quad .Lsinsin_coscos_piby4 # 12
+ .quad .Lsinsin_cossin_piby4 # 13 +
+ .quad .Lsinsin_sincos_piby4 # 14
+ .quad .Lsinsin_sinsin_piby4 # 15 *
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .weak vrda_cos_
+ .set vrda_cos_,__vrda_cos__
+ .weak vrda_cos__
+ .set vrda_cos__,__vrda_cos__
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array cos
+#** VRDA_COS(N,X,Y)
+# C equivalent*/
+#void vrda_cos__(int * n, double *x, double *y)
+#{
+# vrda_cos(*n,x,y);
+#}
+.globl __vrda_cos__
+ .type __vrda_cos__,@function
+__vrda_cos__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_temp, 0x00 # temporary for get/put bits operation
+.equ p_temp1, 0x10 # temporary for get/put bits operation
+
+.equ p_xmm6, 0x20 # temporary for get/put bits operation
+.equ p_xmm7, 0x30 # temporary for get/put bits operation
+.equ p_xmm8, 0x40 # temporary for get/put bits operation
+.equ p_xmm9, 0x50 # temporary for get/put bits operation
+.equ p_xmm10, 0x60 # temporary for get/put bits operation
+.equ p_xmm11, 0x70 # temporary for get/put bits operation
+.equ p_xmm12, 0x80 # temporary for get/put bits operation
+.equ p_xmm13, 0x90 # temporary for get/put bits operation
+.equ p_xmm14, 0x0A0 # temporary for get/put bits operation
+.equ p_xmm15, 0x0B0 # temporary for get/put bits operation
+
+.equ r, 0x0C0 # pointer to r for remainder_piby2
+.equ rr, 0x0D0 # pointer to r for remainder_piby2
+.equ region, 0x0E0 # pointer to r for remainder_piby2
+
+.equ r1, 0x0F0 # pointer to r for remainder_piby2
+.equ rr1, 0x0100 # pointer to r for remainder_piby2
+.equ region1, 0x0110 # pointer to r for remainder_piby2
+
+.equ p_temp2, 0x0120 # temporary for get/put bits operation
+.equ p_temp3, 0x0130 # temporary for get/put bits operation
+
+.equ p_temp4, 0x0140 # temporary for get/put bits operation
+.equ p_temp5, 0x0150 # temporary for get/put bits operation
+
+.equ p_original, 0x0160 # original x
+.equ p_mask, 0x0170 # original x
+.equ p_sign, 0x0180 # original x
+
+.equ p_original1, 0x0190 # original x
+.equ p_mask1, 0x01A0 # original x
+.equ p_sign1, 0x01B0 # original x
+
+.equ save_xa, 0x01C0 #qword
+.equ save_ya, 0x01D0 #qword
+
+.equ save_nv, 0x01E0 #qword
+.equ p_iter, 0x01F0 #qword storage for number of loop iterations
+
+
+.globl vrda_cos
+ .type vrda_cos,@function
+vrda_cos:
+# parameters are passed in by Linux C as:
+# edi - int n
+# rsi - double *x
+# rdx - double *y
+
+
+ sub $0x208,%rsp
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START PROCESS INPUT
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+ mov %rdi,save_nv(%rsp) # save number of values
+
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vrda_cleanup # jump if only single calls
+
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrda_top:
+# build the input _m128d
+ movapd .L__real_7fffffffffffffff(%rip),%xmm2
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlpd (%rsi),%xmm0
+ movhpd 8(%rsi),%xmm0
+
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+ movdqa %xmm0,p_original(%rsp)
+ movlpd -16(%rsi), %xmm1
+ movhpd -8(%rsi), %xmm1
+ movdqa %xmm1,p_original1(%rsp)
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+andpd %xmm2,%xmm0 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+movd %xmm0,%rax #rax is lower arg
+movhpd %xmm0, p_temp+8(%rsp) #
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+movd %xmm1,%r8 #rax is lower arg
+movhpd %xmm1, p_temp1+8(%rsp) #
+mov p_temp1+8(%rsp),%r9 #rcx = upper arg
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use
+
+movapd %xmm0,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm0,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+#DEBUG
+# add $0x1C8,%rsp
+# ret
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm0
+ mulpd %xmm0,%xmm2 # * twobypi
+ mulpd %xmm0,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%rax # Region
+ movd %xmm5,%rcx # Region
+
+ mov %rax,%r8
+ mov %rcx,%r9
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+# paddd .L__reald_one_one(%rip),%xmm4 ; Sign
+# paddd .L__reald_one_one(%rip),%xmm5 ; Sign
+# pand .L__reald_two_two(%rip),%xmm4
+# pand .L__reald_two_two(%rip),%xmm5
+# punpckldq %xmm4,%xmm4
+# punpckldq %xmm5,%xmm5
+# psllq $62,%xmm4
+# psllq $62,%xmm5
+
+
+ add .L__reald_one_one(%rip),%r8
+ add .L__reald_one_one(%rip),%r9
+ and .L__reald_two_two(%rip),%r8
+ and .L__reald_two_two(%rip),%r9
+
+ mov %r8,%r10
+ mov %r9,%r11
+ shl $62,%r8
+ and .L__reald_two_zero(%rip),%r10
+ shl $30,%r10
+ shl $62,%r9
+ and .L__reald_two_zero(%rip),%r11
+ shl $30,%r11
+
+ mov %r8,p_sign(%rsp)
+ mov %r10,p_sign+8(%rsp)
+ mov %r9,p_sign1(%rsp)
+ mov %r11,p_sign1+8(%rsp)
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm0,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ and .L__reald_one_one(%rip),%rax # Region
+ and .L__reald_one_one(%rip),%rcx # Region
+
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+ subpd %xmm0,%xmm6 #rr=rhead-r
+ subpd %xmm1,%xmm7 #rr=rhead-r
+
+ mov %rax,%r8
+ mov %rcx,%r9
+
+ movapd %xmm0,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm0,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail
+ subpd %xmm9,%xmm7 #rr=(rhead-r) -rtail
+
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+ leaq .Levencos_oddsin_tbl(%rip),%rsi
+ jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+#DEBUG
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm10, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+#DEBUG
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+ movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r+8(%rsp) # store upper r
+ movlpd %xmm6,rr+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_lower_naninf
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrd4_cos_lower_naninf:
+ mov p_original(%rsp),%rax # upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) # rr = 0
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+
+
+#DEBUG
+# movapd .LOWORD,%xmm4 PTR r[rsp]
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+
+#DEBUG
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrd4_cos_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+ mov p_original(%rsp),%rax
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) #rr = 0
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrd4_cos_upper_naninf_of_both_gt_5e5:
+ mov p_original+8(%rsp),%rcx #upper arg is nan/inf
+# movd %xmm6,%rcx ;upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) #rr = 0
+ mov %r10d,region+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm10,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm0,%xmm0 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm2,%xmm2
+# movlhps %xmm6,%xmm6
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r(%rsp) # store upper r
+ movlpd %xmm6,rr(%rsp) # store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_upper_naninf
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrd4_cos_upper_naninf:
+ mov p_original+8(%rsp),%rcx # upper arg is nan/inf
+# mov r+8(%rsp),%rcx ; upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) # rr = 0
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+#DEBUG
+# movapd r(%rsp),%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm5,region1(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm1,%xmm7 # rhead
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+ subpd %xmm1,%xmm7 # rr=rhead-r
+ subpd %xmm9,%xmm7 # rr=(rhead-r) -rtail
+ movapd %xmm7,rr1(%rsp)
+
+ jmp .L__vrd4_cos_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm10, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+#DEBUG
+# movapd %xmm0,%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm4,region(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ movapd %xmm0,r(%rsp)
+
+ subpd %xmm0,%xmm6 # rr=rhead-r
+ subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail
+ movapd %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+
+#DEBUG
+# movapd r(%rsp),%xmm4
+# movapd %xmm1,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ mov $0x411E848000000000,%r10 #5e5 +
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+ movsd %xmm7,%xmm1
+ subsd %xmm0,%xmm1 # xmm1 = r=(rhead-rtail)
+ subsd %xmm1,%xmm7 # rr=rhead-r
+ subsd %xmm0,%xmm7 # xmm7 = rr=((rhead-r) -rtail)
+ movlpd %xmm1,r1+8(%rsp) # store upper r
+ movlpd %xmm7,rr1+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr1(%rsp),%rsi
+ lea r1(%rsp),%rdi
+ movlpd r1(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_cos_lower_naninf_higher:
+ mov p_original1(%rsp),%r8 # upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1(%rsp) # rr = 0
+ mov %r10d,region1(%rsp) # region =0
+
+.align 16
+0:
+
+
+#DEBUG
+# movapd rr(%rsp),%xmm4
+# movapd rr1(%rsp),%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ jmp .L__vrd4_cos_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+#DEBUG
+# movapd r(%rsp),%xmm4
+# movd %r8,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr1(%rsp),%rsi
+ lea r1(%rsp),%rdi
+ movsd %xmm1,%xmm0
+ call __amd_remainder_piby2@PLT
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+
+
+#DEBUG
+# movapd r(%rsp),%xmm4
+# mov QWORD PTR r1[rsp+8], r9
+# movapd r1(%rsp),%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+
+ jmp 0f
+
+.L__vrd4_cos_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+ mov p_original1(%rsp),%r8
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1(%rsp) #rr = 0
+ mov %r10d,region1(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr1+8(%rsp),%rsi
+ lea r1+8(%rsp),%rdi
+ movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_cos_upper_naninf_of_both_gt_5e5_higher:
+ mov p_original1+8(%rsp),%r9 #upper arg is nan/inf
+# movd %xmm6,%r9 ;upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1+8(%rsp) #rr = 0
+ mov %r10d,region1+4(%rsp) #region = 0
+
+.align 16
+0:
+
+#DEBUG
+# movapd r(%rsp),%xmm4
+# movapd r1(%rsp),%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+
+ jmp .L__vrd4_cos_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm4,region(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ movapd %xmm0,r(%rsp)
+
+ subpd %xmm0,%xmm6 # rr=rhead-r
+ subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail
+ movapd %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm1,%xmm1 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm3,%xmm3
+# movlhps %xmm7,%xmm7
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+ movsd %xmm7,%xmm1
+ subsd %xmm0,%xmm1 # xmm0 = r=(rhead-rtail)
+ subsd %xmm1,%xmm7 # rr=rhead-r
+ subsd %xmm0,%xmm7 # xmm6 = rr=((rhead-r) -rtail)
+
+ movlpd %xmm1,r1(%rsp) # store upper r
+ movlpd %xmm7,rr1(%rsp) # store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_cos_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr1+8(%rsp),%rsi
+ lea r1+8(%rsp),%rdi
+ movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_cos_upper_naninf_higher:
+ mov p_original1+8(%rsp),%r9 # upper arg is nan/inf
+# mov r1+8(%rsp),%r9 # upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1+8(%rsp) # rr = 0
+ mov %r10d,region1+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrd4_cos_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_cos_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#DEBUG
+# movapd region(%rsp),%xmm4
+# movapd region1(%rsp),%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ movapd r(%rsp),%xmm0
+ movapd r1(%rsp),%xmm1
+
+ movapd rr(%rsp),%xmm6
+ movapd rr1(%rsp),%xmm7
+
+ mov region(%rsp),%rax
+ mov region1(%rsp),%rcx
+
+ mov %rax,%r8
+ mov %rcx,%r9
+
+ add .L__reald_one_one(%rip),%r8
+ add .L__reald_one_one(%rip),%r9
+ and .L__reald_two_two(%rip),%r8
+ and .L__reald_two_two(%rip),%r9
+
+ mov %r8,%r10
+ mov %r9,%r11
+ shl $62,%r8
+ and .L__reald_two_zero(%rip),%r10
+ shl $30,%r10
+ shl $62,%r9
+ and .L__reald_two_zero(%rip),%r11
+ shl $30,%r11
+
+ mov %r8,p_sign(%rsp)
+ mov %r10,p_sign+8(%rsp)
+ mov %r9,p_sign1(%rsp)
+ mov %r11,p_sign1+8(%rsp)
+
+ and .L__reald_one_one(%rip),%rax # Region
+ and .L__reald_one_one(%rip),%rcx # Region
+
+ mov %rax,%r8
+ mov %rcx,%r9
+
+ movapd %xmm0,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm0,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+#DEBUG
+# movd %rax,%xmm4
+# movd %rax,%xmm5
+# xorpd %xmm0,%xmm0
+# xorpd %xmm1,%xmm1
+# jmp .L__vrd4_cos_cleanup
+#DEBUG
+
+ leaq .Levencos_oddsin_tbl(%rip),%rsi
+ jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_cos_cleanup:
+
+ movapd p_sign(%rsp), %xmm0
+ movapd p_sign1(%rsp),%xmm1
+
+ xorpd %xmm4,%xmm0 # (+) Sign
+ xorpd %xmm5,%xmm1 # (+) Sign
+
+.L__vrda_bottom1:
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+
+.L__vrda_bottom2:
+ prefetch 64(%rdi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movlpd %xmm1, -16(%rdi)
+ movhpd %xmm1, -8(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vrda_top
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vrda_cleanup
+
+.L__final_check:
+ add $0x208,%rsp
+ ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# we assume that rdx is pointing at the next x array element, r8 at the next y array element.
+# The number of values left is in save_nv
+
+.align 16
+.L__vrda_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorpd %xmm0,%xmm0
+ movlpd %xmm0,p_temp+8(%rsp)
+ movapd %xmm0,p_temp+16(%rsp)
+
+ mov (%rsi),%rcx # we know there's at least one
+ mov %rcx,p_temp(%rsp)
+ cmp $2,%rax
+ jl .L__vrdacg
+
+ mov 8(%rsi),%rcx # do the second value
+ mov %rcx,p_temp+8(%rsp)
+ cmp $3,%rax
+ jl .L__vrdacg
+
+ mov 16(%rsi),%rcx # do the third value
+ mov %rcx,p_temp+16(%rsp)
+
+.L__vrdacg:
+ mov $4,%rdi # parameter for N
+ lea p_temp(%rsp),%rsi # &x parameter
+ lea p_temp2(%rsp),%rdx # &y parameter
+ call vrda_cos@PLT
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p_temp2(%rsp),%rcx
+ mov %rcx, (%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vrdacgf
+
+ mov p_temp2+8(%rsp),%rcx
+ mov %rcx, 8(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vrdacgf
+
+ mov p_temp2+16(%rsp),%rcx
+ mov %rcx, 16(%rdi) # do the third value
+
+.L__vrdacgf:
+ jmp .L__final_check
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm2 # x4 recalculate
+ mulpd %xmm3,%xmm3 # x4 recalculate
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subpd %xmm13,%xmm5 # + t
+
+ jmp .L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # s6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # s6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # s3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm10,%xmm10 # move high x4 for cos term
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd %xmm0,%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos
+ movhlps %xmm3,%xmm13 # move high r for cos
+ movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos
+ movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm7,%xmm5 # sin *x3
+ mulsd %xmm10,%xmm8 # cos *x4
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term
+ movsd %xmm12,%xmm6 # Keep high r for cos term
+ movsd %xmm13,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t=r-1.0
+
+ subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx
+ subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x*xx for cos term
+ movhlps %xmm1,%xmm11 # move high x for x*xx for cos term
+
+ mulsd p_temp+8(%rsp),%xmm10 # x * xx
+ mulsd p_temp1+8(%rsp),%xmm11 # x * xx
+
+ movsd %xmm12,%xmm2 # move -t for cos term
+ movsd %xmm13,%xmm3 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t)
+ addsd p_temp(%rsp),%xmm4 # sin+xx
+ addsd p_temp1(%rsp),%xmm5 # sin+xx
+ subsd %xmm6,%xmm12 # (1-t) - r
+ subsd %xmm7,%xmm13 # (1-t) - r
+ subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx
+ addsd %xmm0,%xmm4 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+ addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx)
+ addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx)
+ subsd %xmm2,%xmm8 # cos+t
+ subsd %xmm3,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4: # changed from sincos_sincos
+ # xmm1 is cossin and xmm0 is sincos
+
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm1,p_temp3(%rsp) # Store r for the sincos term
+
+ movapd .Lsincosarray+0x50(%rip),%xmm4 # s6
+ movapd .Lcossinarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lsincosarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm10,%xmm10 # move high x4 for cos term
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin)
+ movhlps %xmm3,%xmm7 # move high x2 for x3 for sin term (sincos)
+
+ mulsd %xmm0,%xmm6 # get low x3 for sin term
+ mulsd p_temp3+8(%rsp),%xmm7 # get high x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos (cossin)
+ movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term (sincos)
+
+ movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin)
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos)
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm11,%xmm5 # cos *x4
+ mulsd %xmm10,%xmm8 # cos *x4
+ mulsd %xmm7,%xmm9 # sin *x3
+
+ mulsd p_temp(%rsp),%xmm2 # low 0.5 * x2 * xx for sin term (cossin)
+ mulsd p_temp1+8(%rsp),%xmm13 # high 0.5 * x2 * xx for sin term (sincos)
+
+ movsd %xmm12,%xmm6 # Keep high r for cos term
+ movsd %xmm3,%xmm7 # Keep low r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0
+
+ subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx (cossin)
+ subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx (sincos)
+
+ movhlps %xmm0,%xmm10 # move high x for x*xx for cos term (cossin)
+ movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos)
+
+ mulsd p_temp+8(%rsp),%xmm10 # x * xx
+ mulsd p_temp1(%rsp),%xmm1 # x * xx
+
+ movsd %xmm12,%xmm2 # move -t for cos term
+ movsd %xmm3,%xmm13 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t)
+
+ addsd p_temp(%rsp),%xmm4 # sin+xx +
+ addsd p_temp1+8(%rsp),%xmm9 # sin+xx +
+
+ subsd %xmm6,%xmm12 # (1-t) - r
+ subsd %xmm7,%xmm3 # (1-t) - r
+
+ subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm0,%xmm4 # sin + x +
+ addsd %xmm11,%xmm9 # sin + x +
+
+ addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx)
+ addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm2,%xmm8 # cos+t
+ subsd %xmm13,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm0,p_temp2(%rsp) # Store r
+ movapd %xmm1,p_temp3(%rsp) # Store r
+
+
+ movapd .Lcossinarray+0x50(%rip),%xmm4 # s6
+ movapd .Lcossinarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movhlps %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term
+ mulsd p_temp3+8(%rsp),%xmm7 # get low x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term
+ movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term
+ # Reverse 12 and 2
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos
+
+ mulsd %xmm6,%xmm8 # sin *x3
+ mulsd %xmm7,%xmm9 # sin *x3
+ mulsd %xmm10,%xmm4 # cos *x4
+ mulsd %xmm11,%xmm5 # cos *x4
+
+ mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1+8(%rsp),%xmm13 # 0.5 * x2 * xx for sin term
+ movsd %xmm2,%xmm6 # Keep high r for cos term
+ movsd %xmm3,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0
+
+ subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx
+ subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x for sin term
+ # Reverse 10 and 0
+
+ mulsd p_temp(%rsp),%xmm0 # x * xx
+ mulsd p_temp1(%rsp),%xmm1 # x * xx
+
+ movsd %xmm2,%xmm12 # move -t for cos term
+ movsd %xmm3,%xmm13 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t)
+ addsd p_temp+8(%rsp),%xmm8 # sin+xx
+ addsd p_temp1+8(%rsp),%xmm9 # sin+xx
+
+ subsd %xmm6,%xmm2 # (1-t) - r
+ subsd %xmm7,%xmm3 # (1-t) - r
+
+ subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm8 # sin + x
+ addsd %xmm11,%xmm9 # sin + x
+
+ addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx)
+ addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm12,%xmm4 # cos+t
+ subsd %xmm13,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_sincos_piby4: # changed from sincos_sincos
+ # xmm1 is cossin and xmm0 is sincos
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm0,p_temp2(%rsp) # Store r
+
+
+ movapd .Lcossinarray+0x50(%rip),%xmm4 # s6
+ movapd .Lsincosarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lsincosarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm11,%xmm11 # move high x4 for cos term +
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term +
+ mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term +
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term
+ movhlps %xmm3,%xmm13 # move high r for cos
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin
+
+ mulsd %xmm6,%xmm8 # sin *x3
+ mulsd %xmm11,%xmm9 # cos *x4
+ mulsd %xmm10,%xmm4 # cos *x4
+ mulsd %xmm7,%xmm5 # sin *x3
+
+ mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term
+
+ movsd %xmm2,%xmm6 # Keep high r for cos term
+ movsd %xmm13,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t=r-1.0
+
+ subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx
+ subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x*xx for cos term
+
+ mulsd p_temp(%rsp),%xmm0 # x * xx
+ mulsd p_temp1+8(%rsp),%xmm11 # x * xx
+
+ movsd %xmm2,%xmm12 # move -t for cos term
+ movsd %xmm13,%xmm3 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t)
+
+ addsd p_temp+8(%rsp),%xmm8 # sin+xx
+ addsd p_temp1(%rsp),%xmm5 # sin+xx
+
+ subsd %xmm6,%xmm2 # (1-t) - r
+ subsd %xmm7,%xmm13 # (1-t) - r
+
+ subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx
+
+
+ addsd %xmm10,%xmm8 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+
+ addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx)
+ addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm12,%xmm4 # cos+t
+ subsd %xmm3,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ movapd %xmm2,p_temp2(%rsp) # store x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ movapd %xmm11,p_temp3(%rsp) # store r
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for 0.5*x2
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm10 # x6
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm12 # 0.5 *x2
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm0,%xmm2 # x3 recalculate
+ mulpd %xmm3,%xmm3 # x4 recalculate
+
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm6,%xmm12 # 0.5 * x2 *xx
+ mulpd %xmm1,%xmm7 # x * xx
+
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm12,%xmm4 # -0.5 * x2 *xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm6,%xmm4 # x3 * zs +xx
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+ addpd %xmm0,%xmm4 # +x
+ subpd %xmm13,%xmm5 # + t
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ movapd %xmm3,p_temp3(%rsp) # store x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ movapd %xmm10,p_temp2(%rsp) # store r
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ mulpd %xmm3,%xmm11 # x4
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for 0.5*x2
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm11 # x6
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd .L__real_3fe0000000000000(%rip),%xmm13 # 0.5 *x2
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm2 # x4 recalculate
+ mulpd %xmm1,%xmm3 # x3 recalculate
+
+ movapd p_temp2(%rsp),%xmm12 # r
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm7,%xmm13 # 0.5 * x2 *xx
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x3 * zs
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;;
+ subpd %xmm13,%xmm5 # -0.5 * x2 *xx
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm7,%xmm5 # +xx
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ addpd %xmm1,%xmm5 # +x
+ subpd %xmm12,%xmm4 # + t
+
+ jmp .L__vrd4_cos_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4: #Derive from cossin_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+ movhlps %xmm10,%xmm10 # get upper r for t for cos
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ movsd %xmm0,%xmm8 # lower x for sin
+ mulsd %xmm2,%xmm8 # lower x3 for sin
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # upper x4 for cos
+ movsd %xmm8,%xmm2 # lower x3 for sin
+
+ movsd %xmm6,%xmm9 # lower xx
+ # note using odd reg
+
+ movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm0,%xmm6 # x * xx for upper cos term
+ mulpd %xmm1,%xmm7 # x * xx
+ movhlps %xmm6,%xmm6
+ mulsd p_temp2(%rsp),%xmm9 # xx * 0.5*x2 for sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+ # x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin
+
+ subsd %xmm9,%xmm4 # x3zs - 0.5*x2*xx
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp(%rsp),%xmm4 # +xx
+
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subsd %xmm12,%xmm8 # + t
+ addsd %xmm0,%xmm4 # +x
+ subpd %xmm13,%xmm5 # + t
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4: #Derive from sincos_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcossinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zszc
+ addpd %xmm9,%xmm5 # z
+
+ mulpd %xmm0,%xmm2 # upper x3 for sin
+ mulsd %xmm0,%xmm2 # lower x4 for cos
+ mulpd %xmm3,%xmm3 # x4
+
+ movhlps %xmm6,%xmm9 # upper xx for sin term
+ # note using odd reg
+
+ movlpd p_temp2(%rsp),%xmm12 # lower r for cos term
+ movapd p_temp3(%rsp),%xmm13 # r
+
+
+ mulpd %xmm0,%xmm6 # x * xx for lower cos term
+ mulpd %xmm1,%xmm7 # x * xx
+
+ mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+ mulpd %xmm3,%xmm5
+ # x4 * zc
+
+ movhlps %xmm4,%xmm8 # xmm8= sin, xmm4= cos
+ subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx
+
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp+8(%rsp),%xmm8 # +xx
+
+ movhlps %xmm0,%xmm0 # upper x for sin
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subsd %xmm12,%xmm4 # + t
+ subpd %xmm13,%xmm5 # + t
+ addsd %xmm0,%xmm8 # +x
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+ movhlps %xmm11,%xmm11 # get upper r for t for cos
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zcs
+
+ movsd %xmm1,%xmm9 # lower x for sin
+ mulsd %xmm3,%xmm9 # lower x3 for sin
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # upper x4 for cos
+ movsd %xmm9,%xmm3 # lower x3 for sin
+
+ movsd %xmm7,%xmm8 # lower xx
+ # note using even reg
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx for upper cos term
+ movhlps %xmm7,%xmm7
+ mulsd p_temp3(%rsp),%xmm8 # xx * 0.5*x2 for sin term
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+ # x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin
+
+ subsd %xmm8,%xmm5 # x3zs - 0.5*x2*xx
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1(%rsp),%xmm5 # +xx
+
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subsd %xmm13,%xmm9 # + t
+ addsd %xmm1,%xmm5 # +x
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4: # Derived from sincos_sinsin
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ movhlps %xmm11,%xmm11
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm6,%xmm10 # 0.5*x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zczs
+
+ movsd %xmm3,%xmm12
+ mulsd %xmm1,%xmm12 # low x3 for sin
+
+ mulpd %xmm0,%xmm2 # x3
+ mulpd %xmm3,%xmm3 # high x4 for cos
+ movsd %xmm12,%xmm3 # low x3 for sin
+
+ movhlps %xmm1,%xmm8 # upper x for cos term
+ # note using even reg
+ movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term
+
+ mulsd p_temp1+8(%rsp),%xmm8 # x * xx for upper cos term
+
+ mulsd p_temp3(%rsp),%xmm7 # xx * 0.5*x2 for lower sin term
+
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin
+
+ subsd %xmm7,%xmm5 # x3zs - 0.5*x2*xx
+
+ subsd %xmm8,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx
+ addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1(%rsp),%xmm5 # +xx
+
+ addpd %xmm6,%xmm4 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+
+ addsd %xmm1,%xmm5 # +x
+ addpd %xmm0,%xmm4 # +x
+ subsd %xmm13,%xmm9 # + t
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcossinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm7,%xmm8 # upper xx for sin term
+ # note using even reg
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movlpd p_temp3(%rsp),%xmm13 # lower r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx for lower cos term
+
+ mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos
+
+ subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1+8(%rsp),%xmm9 # +xx
+
+ movhlps %xmm1,%xmm1 # upper x for sin
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subsd %xmm13,%xmm5 # + t
+ addsd %xmm1,%xmm9 # +x
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsincos_sinsin_piby4: # Derived from sincos_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcossinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm6,%xmm10 # 0.5x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm0,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm7,%xmm8 # upper xx for sin term
+ # note using even reg
+
+ movlpd p_temp3(%rsp),%xmm13 # lower r for cos term
+
+ mulpd %xmm1,%xmm7 # x * xx for lower cos term
+
+ mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos
+
+ subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx
+
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx
+ addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1+8(%rsp),%xmm9 # +xx
+
+ movhlps %xmm1,%xmm1 # upper x for sin
+ addpd %xmm6,%xmm4 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ addsd %xmm1,%xmm9 # +x
+ addpd %xmm0,%xmm4 # +x
+ subsd %xmm13,%xmm5 # + t
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsinsin_cossin_piby4: # Derived from sincos_sinsin
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # x2
+ movapd %xmm6,p_temp(%rsp) # xx
+
+ movhlps %xmm10,%xmm10
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm7,%xmm11 # 0.5*x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zs
+
+
+ movsd %xmm2,%xmm13
+ mulsd %xmm0,%xmm13 # low x3 for sin
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm2,%xmm2 # high x4 for cos
+ movsd %xmm13,%xmm2 # low x3 for sin
+
+
+ movhlps %xmm0,%xmm9 # upper x for cos term ; note using even reg
+ movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term
+ mulsd p_temp+8(%rsp),%xmm9 # x * xx for upper cos term
+ mulsd p_temp2(%rsp),%xmm6 # xx * 0.5*x2 for lower sin term
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin
+ subsd %xmm6,%xmm4 # x3zs - 0.5*x2*xx
+
+ subsd %xmm9,%xmm10 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx
+
+ addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp(%rsp),%xmm4 # +xx
+
+ addpd %xmm7,%xmm5 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+
+ addsd %xmm0,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+ subsd %xmm12,%xmm8 # + t
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_cos_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4: # Derived from sincos_coscos
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcossinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm7,%xmm11 # 0.5x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm0,%xmm2 # upper x3 for sin
+ mulsd %xmm0,%xmm2 # lower x4 for cos
+
+ movhlps %xmm6,%xmm9 # upper xx for sin term
+ # note using even reg
+
+ movlpd p_temp2(%rsp),%xmm12 # lower r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx for lower cos term
+
+ mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm9= sin, xmm5= cos
+
+ subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx
+ addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp+8(%rsp),%xmm8 # +xx
+
+ movhlps %xmm0,%xmm0 # upper x for sin
+ addpd %xmm7,%xmm5 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+
+
+ addsd %xmm0,%xmm8 # +x
+ addpd %xmm1,%xmm5 # +x
+ subsd %xmm12,%xmm4 # + t
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_cos_cleanup
+
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ movapd %xmm2,p_temp2(%rsp) # copy of x2
+ movapd %xmm3,p_temp3(%rsp) # copy of x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm2,%xmm10 # x6
+ mulpd %xmm3,%xmm11 # x6
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 *x2
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm6,%xmm2 # 0.5 * x2 *xx
+ mulpd %xmm7,%xmm3 # 0.5 * x2 *xx
+
+ mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zs
+
+ movapd p_temp2(%rsp),%xmm10 # x2
+ movapd p_temp3(%rsp),%xmm11 # x2
+
+ mulpd %xmm0,%xmm10 # x3
+ mulpd %xmm1,%xmm11 # x3
+
+ mulpd %xmm10,%xmm4 # x3 * zs
+ mulpd %xmm11,%xmm5 # x3 * zs
+
+ subpd %xmm2,%xmm4 # -0.5 * x2 *xx
+ subpd %xmm3,%xmm5 # -0.5 * x2 *xx
+
+ addpd %xmm6,%xmm4 # +xx
+ addpd %xmm7,%xmm5 # +xx
+
+ addpd %xmm0,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrd4_cos_cleanup
diff --git a/src/gas/vrdaexp.S b/src/gas/vrdaexp.S
new file mode 100644
index 0000000..1ee640e
--- /dev/null
+++ b/src/gas/vrdaexp.S
@@ -0,0 +1,619 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdaexp.asm
+#
+# An array implementation of the exp libm function.
+#
+# Prototype:
+#
+# void vrda_exp(int n, double *x, double *y);
+#
+# Computes e raised to the x power for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking. Denormal results are truncated to 0.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ p_temp,0 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for exponent multiply
+
+.equ save_xa,0x020 #qword
+.equ save_ya,0x028 #qword
+.equ save_nv,0x030 #qword
+
+.equ p_iter,0x038 # qword storage for number of loop iterations
+
+.equ p2_temp,0x40 # second temporary for get/put bits operation
+ # large enough for two vectors
+.equ p2_temp1,0x60 # second temporary for exponent multiply
+ # large enough for two vectors
+.equ save_rbx,0x080 #qword
+
+.equ stack_size,0x088
+
+ .weak vrda_exp_
+ .set vrda_exp_,__vrda_exp__
+ .weak vrda_exp__
+ .set vrda_exp__,__vrda_exp__
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array exp
+#** VRDA_EXP(N,X,Y)
+# C equivalent*/
+#void vrda_exp__(int * n, double *x, double *y)
+#{
+# vrda_exp(*n,x,y);
+#}
+.globl __vrda_exp__
+ .type __vrda_exp__,@function
+__vrda_exp__:
+ mov (%rdi),%edi
+
+
+ .align 16
+ .p2align 4,,15
+
+
+# parameters are passed in by gcc as:
+# edi - int n
+# rsi - double *x
+# rdx - double *y
+
+
+.globl vrda_exp
+ .type vrda_exp,@function
+vrda_exp:
+
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp)
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vda_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 4 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+ movapd .L__real_thirtytwo_by_log2(%rip),%xmm3 #
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlpd (%rsi),%xmm0
+ movhpd 8(%rsi),%xmm0
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+# compute the exponents
+
+# Step 1. Reduce the argument.
+# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+# r = x * thirtytwo_by_logbaseof2;
+ movapd %xmm3,%xmm7
+ movapd %xmm0,p_temp(%rsp)
+ maxpd .L__real_C0F0000000000000(%rip),%xmm0 # protect against very large negative, non-infinite numbers
+ mulpd %xmm0,%xmm3
+
+ movlpd -16(%rsi),%xmm6
+ movhpd -8(%rsi),%xmm6
+ movapd %xmm6,p2_temp(%rsp)
+ maxpd .L__real_C0F0000000000000(%rip),%xmm6
+ mulpd %xmm6,%xmm7
+
+# save x for later.
+ minpd .L__real_40F0000000000000(%rip),%xmm3 # protect against very large, non-infinite numbers
+
+# /* Set n = nearest integer to r */
+ cvtpd2dq %xmm3,%xmm4
+ lea .L__two_to_jby32_lead_table(%rip),%rdi
+ lea .L__two_to_jby32_trail_table(%rip),%rsi
+ cvtdq2pd %xmm4,%xmm1
+ minpd .L__real_40F0000000000000(%rip),%xmm7 # protect against very large, non-infinite numbers
+
+ # r1 = x - n * logbaseof2_by_32_lead;
+ movapd .L__real_log2_by_32_lead(%rip),%xmm2 #
+ mulpd %xmm1,%xmm2 #
+ movq %xmm4,p_temp1(%rsp)
+ subpd %xmm2,%xmm0 # r1 in xmm0,
+
+ cvtpd2dq %xmm7,%xmm2
+ cvtdq2pd %xmm2,%xmm8
+
+# r2 = - n * logbaseof2_by_32_trail;
+ mulpd .L__real_log2_by_32_tail(%rip),%xmm1 # r2 in xmm1
+# j = n & 0x0000001f;
+ mov $0x01f,%r9
+ mov %r9,%r8
+ mov p_temp1(%rsp),%ecx
+ and %ecx,%r9d
+ movq %xmm2,p2_temp1(%rsp)
+ movapd .L__real_log2_by_32_lead(%rip),%xmm9
+ mulpd %xmm8,%xmm9
+ subpd %xmm9,%xmm6 # r1b in xmm6
+ mulpd .L__real_log2_by_32_tail(%rip),%xmm8 # r2b in xmm8
+
+ mov p_temp1+4(%rsp),%edx
+ and %edx,%r8d
+# f1 = two_to_jby32_lead_table[j];
+# f2 = two_to_jby32_trail_table[j];
+
+# *m = (n - j) / 32;
+ sub %r9d,%ecx
+ sar $5,%ecx #m
+ sub %r8d,%edx
+ sar $5,%edx
+
+
+ movapd %xmm0,%xmm2
+ addpd %xmm1,%xmm2 # r = r1 + r2
+
+ mov $0x01f,%r11
+ mov %r11,%r10
+ mov p2_temp1(%rsp),%ebx
+ and %ebx,%r11d
+# Step 2. Compute the polynomial.
+# q = r1 + (r2 +
+# r*r*( 5.00000000000000008883e-01 +
+# r*( 1.66666666665260878863e-01 +
+# r*( 4.16666666662260795726e-02 +
+# r*( 8.33336798434219616221e-03 +
+# r*( 1.38889490863777199667e-03 ))))));
+# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+ movapd %xmm2,%xmm1
+ movapd .L__real_3f56c1728d739765(%rip),%xmm3 # 1/720
+ movapd .L__real_3FC5555555548F7C(%rip),%xmm0 # 1/6
+# deal with infinite results
+ mov $1024,%rax
+ movsx %ecx,%rcx
+ cmp %rax,%rcx
+
+ mulpd %xmm2,%xmm3 # *x
+ mulpd %xmm2,%xmm0 # *x
+ mulpd %xmm2,%xmm1 # x*x
+ movapd %xmm1,%xmm4
+
+ cmovg %rax,%rcx ## if infinite, then set rcx to multiply
+ # by infinity
+ movsx %edx,%rdx
+ cmp %rax,%rdx
+
+ movapd %xmm6,%xmm9
+ addpd %xmm8,%xmm9 # rb = r1b + r2b
+ addpd .L__real_3F811115B7AA905E(%rip),%xmm3 # + 1/120
+ addpd .L__real_3fe0000000000000(%rip),%xmm0 # + .5
+ mulpd %xmm1,%xmm4 # x^4
+ mulpd %xmm2,%xmm3 # *x
+
+ cmovg %rax,%rdx ## if infinite, then set rcx to multiply
+ # by infinity
+# deal with denormal results
+ xor %rax,%rax
+ add $1023,%rcx # add bias
+
+ mulpd %xmm1,%xmm0 # *x^2
+ addpd .L__real_3FA5555555545D4E(%rip),%xmm3 # + 1/24
+ addpd %xmm2,%xmm0 # + x
+ mulpd %xmm4,%xmm3 # *x^4
+
+# check for infinity or nan
+ movapd p_temp(%rsp),%xmm2
+
+ cmovs %rax,%rcx ## if denormal, then multiply by 0
+ shl $52,%rcx # build 2^n
+
+ sub %r11d,%ebx
+ movapd %xmm9,%xmm1
+ addpd %xmm3,%xmm0 # q = final sum
+ movapd .L__real_3f56c1728d739765(%rip),%xmm7 # 1/720
+ movapd .L__real_3FC5555555548F7C(%rip),%xmm3 # 1/6
+
+# *z2 = f2 + ((f1 + f2) * q);
+ movlpd (%rsi,%r9,8),%xmm5 # f2
+ movlpd (%rsi,%r8,8),%xmm4 # f2
+ addsd (%rdi,%r8,8),%xmm4 # f1 + f2
+ addsd (%rdi,%r9,8),%xmm5 # f1 + f2
+ mov p2_temp1+4(%rsp),%r8d
+ and %r8d,%r10d
+ sar $5,%ebx #m
+ mulpd %xmm9,%xmm7 # *x
+ mulpd %xmm9,%xmm3 # *x
+ mulpd %xmm9,%xmm1 # x*x
+ sub %r10d,%r8d
+ sar $5,%r8d
+# check for infinity or nan
+ andpd .L__real_infinity(%rip),%xmm2
+ cmppd $0,.L__real_infinity(%rip),%xmm2
+ add $1023,%rdx # add bias
+ shufpd $0,%xmm4,%xmm5
+ movapd %xmm1,%xmm4
+
+ cmovs %rax,%rdx ## if denormal, then multiply by 0
+ shl $52,%rdx # build 2^n
+
+ mulpd %xmm5,%xmm0
+ mov %rcx,p_temp1(%rsp) # get 2^n to memory
+ mov %rdx,p_temp1+8(%rsp) # get 2^n to memory
+ addpd %xmm5,%xmm0 #z = z1 + z2 done with 1,2,3,4,5
+ mov $1024,%rax
+ movsx %ebx,%rbx
+ cmp %rax,%rbx
+# end of splitexp
+# /* Scale (z1 + z2) by 2.0**m */
+# r = scaleDouble_1(z, n);
+
+
+ cmovg %rax,%rbx ## if infinite, then set rcx to multiply
+ # by infinity
+ movsx %r8d,%rdx
+ cmp %rax,%rdx
+
+ movmskpd %xmm2,%r8d
+
+ addpd .L__real_3F811115B7AA905E(%rip),%xmm7 # + 1/120
+ addpd .L__real_3fe0000000000000(%rip),%xmm3 # + .5
+ mulpd %xmm1,%xmm4 # x^4
+ mulpd %xmm9,%xmm7 # *x
+ cmovg %rax,%rdx ## if infinite, then set rcx to multiply
+
+
+ xor %rax,%rax
+ add $1023,%rbx # add bias
+
+ mulpd %xmm1,%xmm3 # *x^2
+ addpd .L__real_3FA5555555545D4E(%rip),%xmm7 # + 1/24
+ addpd %xmm9,%xmm3 # + x
+ mulpd %xmm4,%xmm7 # *x^4
+
+ cmovs %rax,%rbx ## if denormal, then multiply by 0
+ shl $52,%rbx # build 2^n
+
+# Step 3. Reconstitute.
+
+ mulpd p_temp1(%rsp),%xmm0 # result *= 2^n
+ addpd %xmm7,%xmm3 # q = final sum
+
+ movlpd (%rsi,%r11,8),%xmm5 # f2
+ movlpd (%rsi,%r10,8),%xmm4 # f2
+ addsd (%rdi,%r10,8),%xmm4 # f1 + f2
+ addsd (%rdi,%r11,8),%xmm5 # f1 + f2
+
+ add $1023,%rdx # add bias
+ cmovs %rax,%rdx ## if denormal, then multiply by 0
+ shufpd $0,%xmm4,%xmm5
+ shl $52,%rdx # build 2^n
+
+ mulpd %xmm5,%xmm3
+ mov %rbx,p2_temp1(%rsp) # get 2^n to memory
+ mov %rdx,p2_temp1+8(%rsp) # get 2^n to memory
+ addpd %xmm5,%xmm3 #z = z1 + z2
+
+ movapd p2_temp(%rsp),%xmm2
+ andpd .L__real_infinity(%rip),%xmm2
+ cmppd $0,.L__real_infinity(%rip),%xmm2
+ movmskpd %xmm2,%ebx
+ test $3,%r8d
+ mulpd p2_temp1(%rsp),%xmm3 # result *= 2^n
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them. But it adds cycles for normal cases which
+# are supposed to be exceptions. Using this branch with the
+# check above results in faster code for the normal cases.
+ jnz .L__exp_naninf
+
+.L__vda_bottom1:
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+ test $3,%ebx
+ jnz .L__exp_naninf2
+
+.L__vda_bottom2:
+
+ prefetch 64(%rdi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movlpd %xmm3,-16(%rdi)
+ movhpd %xmm3,-8(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vda_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vda_cleanup
+
+
+#
+#
+.L__final_check:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+# at least one of the numbers needs special treatment
+.L__exp_naninf:
+ lea p_temp(%rsp),%rcx
+ call .L__naninf
+ jmp .L__vda_bottom1
+.L__exp_naninf2:
+ lea p2_temp(%rsp),%rcx
+ mov %ebx,%r8d
+ movapd %xmm3,%xmm0
+ call .L__naninf
+ movapd %xmm0,%xmm3
+ jmp .L__vda_bottom2
+
+# This subroutine checks a double pair for nans and infinities and
+# produces the proper result from the exceptional inputs
+# Register assumptions:
+# Inputs:
+# r8d - mask of errors
+# xmm0 - computed result vector
+# rcx - pointing to memory image of inputs
+# Outputs:
+# xmm0 - new result vector
+# %rax,rdx,,%xmm2 all modified.
+.L__naninf:
+# check the first number
+ test $1,%r8d
+ jz .L__check2
+
+ mov (%rcx),%rdx
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__enan1 # jump if mantissa not zero, so it's a NaN
+# inf
+ mov %rdx,%rax
+ rcl $1,%rax
+ jnc .L__r1 # exp(+inf) = inf
+ xor %rdx,%rdx # exp(-inf) = 0
+ jmp .L__r1
+
+#NaN
+.L__enan1:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__r1:
+ movd %rdx,%xmm2
+ shufpd $2,%xmm0,%xmm2
+ movsd %xmm2,%xmm0
+# check the second number
+.L__check2:
+ test $2,%r8d
+ jz .L__r3
+ mov 8(%rcx),%rdx
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__enan2 # jump if mantissa not zero, so it's a NaN
+# inf
+ mov %rdx,%rax
+ rcl $1,%rax
+ jnc .L__r2 # exp(+inf) = inf
+ xor %rdx,%rdx # exp(-inf) = 0
+ jmp .L__r2
+
+#NaN
+.L__enan2:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__r2:
+ movd %rdx,%xmm2
+ shufpd $0,%xmm2,%xmm0
+.L__r3:
+ ret
+
+ .align 16
+# we jump here when we have an odd number of exp calls to make at the
+# end
+# we assume that rdx is pointing at the next x array element,
+# r8 at the next y array element. The number of values left is in
+# save_nv
+.L__vda_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorpd %xmm0,%xmm0
+ movlpd %xmm0,p2_temp+8(%rsp)
+ movapd %xmm0,p2_temp+16(%rsp)
+
+ mov (%rsi),%rcx # we know there's at least one
+ mov %rcx,p2_temp(%rsp)
+ cmp $2,%rax
+ jl .L_vdacg
+
+ mov 8(%rsi),%rcx # do the second value
+ mov %rcx,p2_temp+8(%rsp)
+ cmp $3,%rax
+ jl .L_vdacg
+
+ mov 16(%rsi),%rcx # do the third value
+ mov %rcx,p2_temp+16(%rsp)
+
+.L_vdacg:
+ mov $4,%rdi # parameter for N
+ lea p2_temp(%rsp),%rsi # &x parameter
+ lea p2_temp1(%rsp),%rdx # &y parameter
+ call vrda_exp@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p2_temp1(%rsp),%rcx
+ mov %rcx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L_vdacgf
+
+ mov p2_temp1+8(%rsp),%rcx
+ mov %rcx,8(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L_vdacgf
+
+ mov p2_temp1+16(%rsp),%rcx
+ mov %rcx,16(%rdi) # do the third value
+
+.L_vdacgf:
+ jmp .L__final_check
+
+ .data
+ .align 64
+
+
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000 # for alignment
+.L__real_4040000000000000: .quad 0x04040000000000000 # 32
+ .quad 0x04040000000000000
+.L__real_40F0000000000000: .quad 0x040F0000000000000 # 65536, to protect against really large numbers
+ .quad 0x040F0000000000000
+.L__real_C0F0000000000000: .quad 0x0C0F0000000000000 # -65536, to protect against really large negative numbers
+ .quad 0x0C0F0000000000000
+.L__real_3FA0000000000000: .quad 0x03FA0000000000000 # 1/32
+ .quad 0x03FA0000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+.L__real_infinity: .quad 0x07ff0000000000000 #
+ .quad 0x07ff0000000000000 # for alignment
+.L__real_ninfinity: .quad 0x0fff0000000000000 #
+ .quad 0x0fff0000000000000 # for alignment
+.L__real_thirtytwo_by_log2: .quad 0x040471547652b82fe # thirtytwo_by_log2
+ .quad 0x040471547652b82fe
+.L__real_log2_by_32_lead: .quad 0x03f962e42fe000000 # log2_by_32_lead
+ .quad 0x03f962e42fe000000
+.L__real_log2_by_32_tail: .quad 0x0Bdcf473de6af278e # -log2_by_32_tail
+ .quad 0x0Bdcf473de6af278e
+.L__real_3f56c1728d739765: .quad 0x03f56c1728d739765 # 1.38889490863777199667e-03
+ .quad 0x03f56c1728d739765
+.L__real_3F811115B7AA905E: .quad 0x03F811115B7AA905E # 8.33336798434219616221e-03
+ .quad 0x03F811115B7AA905E
+.L__real_3FA5555555545D4E: .quad 0x03FA5555555545D4E # 4.16666666662260795726e-02
+ .quad 0x03FA5555555545D4E
+.L__real_3FC5555555548F7C: .quad 0x03FC5555555548F7C # 1.66666666665260878863e-01
+ .quad 0x03FC5555555548F7C
+
+
+.L__two_to_jby32_lead_table:
+ .quad 0x03ff0000000000000 # 1
+ .quad 0x03ff059b0d0000000 # 1.0219
+ .quad 0x03ff0b55860000000 # 1.04427
+ .quad 0x03ff11301d0000000 # 1.06714
+ .quad 0x03ff172b830000000 # 1.09051
+ .quad 0x03ff1d48730000000 # 1.11439
+ .quad 0x03ff2387a60000000 # 1.13879
+ .quad 0x03ff29e9df0000000 # 1.16372
+ .quad 0x03ff306fe00000000 # 1.18921
+ .quad 0x03ff371a730000000 # 1.21525
+ .quad 0x03ff3dea640000000 # 1.24186
+ .quad 0x03ff44e0860000000 # 1.26905
+ .quad 0x03ff4bfdad0000000 # 1.29684
+ .quad 0x03ff5342b50000000 # 1.32524
+ .quad 0x03ff5ab07d0000000 # 1.35426
+ .quad 0x03ff6247eb0000000 # 1.38391
+ .quad 0x03ff6a09e60000000 # 1.41421
+ .quad 0x03ff71f75e0000000 # 1.44518
+ .quad 0x03ff7a11470000000 # 1.47683
+ .quad 0x03ff8258990000000 # 1.50916
+ .quad 0x03ff8ace540000000 # 1.54221
+ .quad 0x03ff93737b0000000 # 1.57598
+ .quad 0x03ff9c49180000000 # 1.61049
+ .quad 0x03ffa5503b0000000 # 1.64576
+ .quad 0x03ffae89f90000000 # 1.68179
+ .quad 0x03ffb7f76f0000000 # 1.71862
+ .quad 0x03ffc199bd0000000 # 1.75625
+ .quad 0x03ffcb720d0000000 # 1.79471
+ .quad 0x03ffd5818d0000000 # 1.83401
+ .quad 0x03ffdfc9730000000 # 1.87417
+ .quad 0x03ffea4afa0000000 # 1.91521
+ .quad 0x03fff507650000000 # 1.95714
+ .quad 0 # for alignment
+.L__two_to_jby32_trail_table:
+ .quad 0x00000000000000000 # 0
+ .quad 0x03e48ac2ba1d73e2a # 1.1489e-008
+ .quad 0x03e69f3121ec53172 # 4.83347e-008
+ .quad 0x03df25b50a4ebbf1b # 2.67125e-010
+ .quad 0x03e68faa2f5b9bef9 # 4.65271e-008
+ .quad 0x03e368b9aa7805b80 # 5.24924e-009
+ .quad 0x03e6ceac470cd83f6 # 5.38622e-008
+ .quad 0x03e547f7b84b09745 # 1.90902e-008
+ .quad 0x03e64636e2a5bd1ab # 3.79764e-008
+ .quad 0x03e5ceaa72a9c5154 # 2.69307e-008
+ .quad 0x03e682468446b6824 # 4.49684e-008
+ .quad 0x03e18624b40c4dbd0 # 1.41933e-009
+ .quad 0x03e54d8a89c750e5e # 1.94147e-008
+ .quad 0x03e5a753e077c2a0f # 2.46409e-008
+ .quad 0x03e6a90a852b19260 # 4.94813e-008
+ .quad 0x03e0d2ac258f87d03 # 8.48872e-010
+ .quad 0x03e59fcef32422cbf # 2.42032e-008
+ .quad 0x03e61d8bee7ba46e2 # 3.3242e-008
+ .quad 0x03e4f580c36bea881 # 1.45957e-008
+ .quad 0x03e62999c25159f11 # 3.46453e-008
+ .quad 0x03e415506dadd3e2a # 8.0709e-009
+ .quad 0x03e29b8bc9e8a0388 # 2.99439e-009
+ .quad 0x03e451f8480e3e236 # 9.83622e-009
+ .quad 0x03e41f12ae45a1224 # 8.35492e-009
+ .quad 0x03e62b5a75abd0e6a # 3.48493e-008
+ .quad 0x03e47daf237553d84 # 1.11085e-008
+ .quad 0x03e6b0aa538444196 # 5.03689e-008
+ .quad 0x03e69df20d22a0798 # 4.81896e-008
+ .quad 0x03e69f7490e4bb40b # 4.83654e-008
+ .quad 0x03e4bdcdaf5cb4656 # 1.29746e-008
+ .quad 0x03e452486cc2c7b9d # 9.84533e-009
+ .quad 0x03e66dc8a80ce9f09 # 4.25828e-008
+ .quad 0 # for alignment
+
diff --git a/src/gas/vrdalog.S b/src/gas/vrdalog.S
new file mode 100644
index 0000000..cdbba18
--- /dev/null
+++ b/src/gas/vrdalog.S
@@ -0,0 +1,954 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdalog.s
+#
+# An array implementation of the log libm function.
+#
+# Prototype:
+#
+# void vrda_log(int n, double *x, double *y);
+#
+# Computes the natural log of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error. This version can compute logs in 44
+# cycles with n <= 24
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+.equ p_xexp,0x020 # index storage
+
+.equ p_x2,0x030 # temporary for error checking operation
+.equ p_idx2,0x040 # index storage
+.equ p_xexp2,0x050 # index storage
+
+.equ save_xa,0x060 #qword
+.equ save_ya,0x068 #qword
+.equ save_nv,0x070 #qword
+.equ p_iter,0x078 # qword storage for number of loop iterations
+
+.equ save_rbx,0x080 #qword
+
+
+.equ p2_temp,0x090 # second temporary for get/put bits operation
+.equ p2_temp1,0x0b0 # second temporary for exponent multiply
+
+.equ p_n1,0x0c0 # temporary for near one check
+.equ p_n12,0x0d0 # temporary for near one check
+
+
+.equ stack_size,0x0e8
+
+ .weak vrda_log_
+ .set vrda_log_,__vrda_log__
+ .weak vrda_log__
+ .set vrda_log__,__vrda_log__
+
+# parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array log
+#** VRDA_LOG(N,X,Y)
+# C equivalent*/
+#void vrda_log__(int * n, double *x, double *y)
+#{
+# vrda_log(*n,x,y);
+#}
+.globl __vrda_log__
+ .type __vrda_log__,@function
+__vrda_log__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+.globl vrda_log
+ .type vrda_log,@function
+vrda_log:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vda_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlpd (%rsi),%xmm0
+ movhpd 8(%rsi),%xmm0
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+ movlpd -16(%rsi),%xmm7
+ movhpd -8(%rsi),%xmm7
+
+# compute the logs
+
+## if NaN or inf
+ movdqa %xmm0,p_x(%rsp) # save the input values
+
+# /* Store the exponent of x in xexp and put
+# f into the range [0.5,1) */
+
+ pxor %xmm1,%xmm1
+ movdqa %xmm0,%xmm3
+ psrlq $52,%xmm3
+ psubq .L__mask_1023(%rip),%xmm3
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm6 # xexp
+ movdqa %xmm7,p_x2(%rsp) # save the input values
+ movdqa %xmm0,%xmm2
+ subpd .L__real_one(%rip),%xmm2
+
+ movapd %xmm6,p_xexp(%rsp)
+ andpd .L__real_notsign(%rip),%xmm2
+ xor %rax,%rax
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+
+ cmppd $1,.L__real_threshold(%rip),%xmm2
+ movmskpd %xmm2,%ecx
+ movdqa %xmm3,%xmm4
+ mov %ecx,p_n1(%rsp)
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrlq $45,%xmm3
+ movdqa %xmm3,%xmm2
+ psrlq $1,%xmm3
+ paddq .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm2
+ paddq %xmm2,%xmm3
+
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm1
+ pxor %xmm7,%xmm7
+ movdqa p_x2(%rsp),%xmm2
+ movapd p_x2(%rsp),%xmm5
+ psrlq $52,%xmm2
+ psubq .L__mask_1023(%rip),%xmm2
+ packssdw %xmm7,%xmm2
+ subpd .L__real_one(%rip),%xmm5
+ andpd .L__real_notsign(%rip),%xmm5
+ cvtdq2pd %xmm2,%xmm6 # xexp
+ xor %rcx,%rcx
+ cmppd $1,.L__real_threshold(%rip),%xmm5
+ movq %xmm3,p_idx(%rsp)
+
+# reduce and get u
+ por .L__real_half(%rip),%xmm4
+ movdqa %xmm4,%xmm2
+ movapd %xmm6,p_xexp2(%rsp)
+
+ # do near one check
+ movmskpd %xmm5,%edx
+ mov %edx,p_n12(%rsp)
+
+ mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128
+
+
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx(%rsp),%eax
+ movdqa p_x2(%rsp),%xmm6
+
+ movapd .L__real_half(%rip),%xmm5 # .5
+ subpd %xmm1,%xmm2 # f2 = f - f1
+ pand .L__real_mant(%rip),%xmm6
+ mulpd %xmm2,%xmm5
+ addpd %xmm5,%xmm1
+
+ movdqa %xmm6,%xmm8
+ psrlq $45,%xmm6
+ movdqa %xmm6,%xmm4
+
+ psrlq $1,%xmm6
+ paddq .L__mask_040(%rip),%xmm6
+ pand .L__mask_001(%rip),%xmm4
+ paddq %xmm4,%xmm6
+# do error checking here for scheduling. Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+## if NaN or inf
+ movapd %xmm0,%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r8d
+ packssdw %xmm7,%xmm6
+ por .L__real_half(%rip),%xmm8
+ movq %xmm6,p_idx2(%rsp)
+ cvtdq2pd %xmm6,%xmm9
+
+ cmppd $2,.L__real_zero(%rip),%xmm0
+ mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128
+ movmskpd %xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+ divpd %xmm1,%xmm2 # u
+
+# compute the index into the log tables
+#
+
+ movlpd -512(%rdx,%rax,8),%xmm0 # z1
+ mov p_idx+4(%rsp),%ecx
+ movhpd -512(%rdx,%rcx,8),%xmm0 # z1
+# solve for ln(1+u)
+ movapd %xmm2,%xmm1 # u
+ mulpd %xmm2,%xmm2 # u^2
+ movapd %xmm2,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm2,%xmm3 #Cu2
+ mulpd %xmm1,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm2 # u^5
+ movapd .L__real_log2_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm1 # u+Au3
+ mulpd %xmm3,%xmm2 # u5(B+Cu2)
+
+ movapd p_xexp(%rsp),%xmm5 # xexp
+ addpd %xmm2,%xmm1 # poly
+# recombine
+ mulpd %xmm5,%xmm4 # xexp * log2_lead
+ addpd %xmm4,%xmm0 #r1
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%eax
+ mov p_idx2+4(%rsp),%ecx
+ addpd %xmm4,%xmm1
+
+ mulpd .L__real_log2_tail(%rip),%xmm5
+
+ movapd .L__real_half(%rip),%xmm4 # .5
+ subpd %xmm9,%xmm8 # f2 = f - f1
+ mulpd %xmm8,%xmm4
+ addpd %xmm4,%xmm9
+
+ addpd %xmm5,%xmm1 #r2
+ divpd %xmm9,%xmm8 # u
+ movapd p_x2(%rsp),%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r10d
+ movapd p_x2(%rsp),%xmm6
+ cmppd $2,.L__real_zero(%rip),%xmm6
+ movmskpd %xmm6,%r11d
+
+# check for nans/infs
+ test $3,%r8d
+ addpd %xmm1,%xmm0
+ jnz .L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+ test $3,%r9d
+ jnz .L__z_or_n
+
+.L__vlog2:
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+
+ # It seems like a good idea to try and interleave
+ # even more of the following code sooner into the
+ # program. But there were conflicts with the table
+ # index registers, making the problem difficult.
+ # After a lot of work in a branch of this file,
+ # I was not able to match the speed of this version.
+ # CodeAnalyst shows that there is lots of unused add
+ # pipe time around the divides, but the processor
+ # doesn't seem to be able to schedule in those slots.
+
+ movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q
+
+# check for near one
+ mov p_n1(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one1
+.L__vlog2n:
+
+ # solve for ln(1+u)
+ movapd %xmm8,%xmm9 # u
+ mulpd %xmm8,%xmm8 # u^2
+ movapd %xmm8,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm8,%xmm3 #Cu2
+ mulpd %xmm9,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm8 # u^5
+ movapd .L__real_log2_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm9 # u+Au3
+ mulpd %xmm3,%xmm8 # u5(B+Cu2)
+
+ movapd p_xexp2(%rsp),%xmm5 # xexp
+ addpd %xmm8,%xmm9 # poly
+ # recombine
+ mulpd %xmm5,%xmm4
+ addpd %xmm4,%xmm7 #r1
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q
+ addpd %xmm2,%xmm9
+
+ mulpd .L__real_log2_tail(%rip),%xmm5
+
+ addpd %xmm5,%xmm9 #r2
+
+ # check for nans/infs
+ test $3,%r10d
+ addpd %xmm9,%xmm7
+ jnz .L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+ test $3,%r11d
+ jnz .L__z_or_n2
+
+.L__vlog4:
+ mov p_n12(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one2
+
+.L__vlog4n:
+
+
+#__vda_bottom2:
+
+ prefetch 64(%rdi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movlpd %xmm7,-16(%rdi)
+ movhpd %xmm7,-8(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vda_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vda_cleanup
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+.Lboth_nearone:
+# saves 10 cycles
+# r = x - 1.0;
+ movapd .L__real_two(%rip),%xmm2
+ subpd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addpd %xmm0,%xmm2
+ movapd %xmm0,%xmm1
+ divpd %xmm2,%xmm1 # u
+ movapd .L__real_ca4(%rip),%xmm4 #D
+ movapd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movapd %xmm0,%xmm6
+ mulpd %xmm1,%xmm6 # correction
+# u = u + u;
+ addpd %xmm1,%xmm1 #u
+ movapd %xmm1,%xmm2
+ mulpd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulpd %xmm1,%xmm5 # Cu
+ movapd %xmm1,%xmm3
+ mulpd %xmm2,%xmm3 # u^3
+ mulpd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulpd %xmm3,%xmm4 #Du^3
+
+ addpd .L__real_ca1(%rip),%xmm2 # +A
+ movapd %xmm3,%xmm1
+ mulpd %xmm1,%xmm1 # u^6
+ addpd %xmm4,%xmm5 #Cu+Du3
+
+ mulpd %xmm3,%xmm2 #u3(A+Bu2)
+ mulpd %xmm5,%xmm1 #u6(Cu+Du3)
+ addpd %xmm1,%xmm2
+ subpd %xmm6,%xmm2 # -correction
+
+# return r + r2;
+ addpd %xmm2,%xmm0
+ ret
+
+ .align 16
+.L__near_one1:
+ cmp $3,%r9d
+ jnz .L__n1nb1
+
+ movapd p_x(%rsp),%xmm0
+ call .Lboth_nearone
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+ jmp .L__vlog2n
+
+ .align 16
+.L__n1nb1:
+ test $1,%r9d
+ jz .L__lnn12
+
+ movlpd p_x(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,(%rdi)
+
+.L__lnn12:
+ test $2,%r9d # second number?
+ jz .L__lnn1e
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,8(%rdi)
+
+.L__lnn1e:
+ jmp .L__vlog2n
+
+
+ .align 16
+.L__near_one2:
+ cmp $3,%r9d
+ jnz .L__n1nb2
+
+ movapd p_x2(%rsp),%xmm0
+ call .Lboth_nearone
+ movapd %xmm0,%xmm7
+ jmp .L__vlog4n
+
+ .align 16
+.L__n1nb2:
+ test $1,%r9d
+ jz .L__lnn22
+
+ movlpd p_x2(%rsp),%xmm0
+ call .L__ln1
+ movsd %xmm0,%xmm7
+
+.L__lnn22:
+ test $2,%r9d # second number?
+ jz .L__lnn2e
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__ln1
+ movlhps %xmm0,%xmm7
+
+.L__lnn2e:
+ jmp .L__vlog4n
+
+ .align 16
+
+.L__ln1:
+# saves 10 cycles
+# r = x - 1.0;
+ movlpd .L__real_two(%rip),%xmm2
+ subsd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addsd %xmm0,%xmm2
+ movsd %xmm0,%xmm1
+ divsd %xmm2,%xmm1 # u
+ movlpd .L__real_ca4(%rip),%xmm4 #D
+ movlpd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movsd %xmm0,%xmm6
+ mulsd %xmm1,%xmm6 # correction
+# u = u + u;
+ addsd %xmm1,%xmm1 #u
+ movsd %xmm1,%xmm2
+ mulsd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulsd %xmm1,%xmm5 # Cu
+ movsd %xmm1,%xmm3
+ mulsd %xmm2,%xmm3 # u^3
+ mulsd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulsd %xmm3,%xmm4 #Du^3
+
+ addsd .L__real_ca1(%rip),%xmm2 # +A
+ movsd %xmm3,%xmm1
+ mulsd %xmm1,%xmm1 # u^6
+ addsd %xmm4,%xmm5 #Cu+Du3
+
+ mulsd %xmm3,%xmm2 #u3(A+Bu2)
+ mulsd %xmm5,%xmm1 #u6(Cu+Du3)
+ addsd %xmm1,%xmm2
+ subsd %xmm6,%xmm2 # -correction
+
+# return r + r2;
+ addsd %xmm2,%xmm0
+ ret
+
+ .align 16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+ test $1,%r8d # first number?
+ jz .L__lninf2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rdx
+ movlpd p_x(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+ test $2,%r8d # second number?
+ jz .L__lninfe
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rdx
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+ jmp .L__vlog1 # continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+ test $1,%r10d # first number?
+ jz .L__lninf22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm7,%xmm1 # save the inputs
+ mov p_x2(%rsp),%rdx
+ movlpd p_x2(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm7,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+ movapd %xmm0,%xmm7
+
+.L__lninf22:
+ test $2,%r10d # second number?
+ jz .L__lninfe2
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rdx
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+ jmp .L__vlog3 # continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__lnan # jump if mantissa not zero, so it's a NaN
+# inf
+ rcl $1,%rdx
+ jnc .L__lne2 # log(+inf) = inf
+# negative x
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+#NaN
+.L__lnan:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__lne:
+ movd %rdx,%xmm0
+.L__lne2:
+ ret
+
+ .align 16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+ test $1,%r9d # first number?
+ jz .L__zn2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+ test $2,%r9d # second number?
+ jz .L__zne
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne:
+ jmp .L__vlog2
+
+.L__z_or_n2:
+ test $1,%r11d # first number?
+ jz .L__zn22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm7,%xmm0
+ movapd %xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+ test $2,%r11d # second number?
+ jz .L__zne2
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+ jmp .L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+ shl $1,%rax
+ jnz .L__zn_x # if just a carry, then must be negative
+ movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0
+ ret
+.L__zn_x:
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+# we assume that rdx is pointing at the next x array element,
+# r8 at the next y array element. The number of values left is in
+# save_nv
+.L__vda_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__finish # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorpd %xmm0,%xmm0
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd %xmm0,p_x+16(%rsp)
+
+ mov (%rsi),%rcx # we know there's at least one
+ mov %rcx,p_x(%rsp)
+ cmp $2,%rax
+ jl .L__vdacg
+
+ mov 8(%rsi),%rcx # do the second value
+ mov %rcx,p_x+8(%rsp)
+ cmp $3,%rax
+ jl .L__vdacg
+
+ mov 16(%rsi),%rcx # do the third value
+ mov %rcx,p_x+16(%rsp)
+
+.L__vdacg:
+ mov $4,%rdi # parameter for N
+ lea p_x(%rsp),%rsi # &x parameter
+ lea p2_temp(%rsp),%rdx # &y parameter
+ call vrda_log@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p2_temp(%rsp),%rcx
+ mov %rcx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+8(%rsp),%rcx
+ mov %rcx,8(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+16(%rsp),%rcx
+ mov %rcx,16(%rdi) # do the third value
+
+.L__vdacgf:
+ jmp .L__finish
+
+ .data
+ .align 64
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_two: .quad 0x04000000000000000 # 2.0
+ .quad 0x04000000000000000
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0fff0000000000000
+.L__real_inf: .quad 0x07ff0000000000000 # +inf
+ .quad 0x07ff0000000000000
+.L__real_nan: .quad 0x07ff8000000000000 # NaN
+ .quad 0x07ff8000000000000
+
+.L__real_zero: .quad 0x00000000000000000 # 0.0
+ .quad 0x00000000000000000
+
+.L__real_sign: .quad 0x08000000000000000 # sign bit
+ .quad 0x08000000000000000
+.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold: .quad 0x03F9EB85000000000 # .03
+ .quad 0x03F9EB85000000000
+.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit
+ .quad 0x00008000000000000
+.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03f80000000000000
+.L__mask_1023: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+.L__mask_040: .quad 0x00000000000000040 #
+ .quad 0x00000000000000040
+.L__mask_001: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001
+
+.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x03fb55555555554e6
+.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x03f89999999bac6d4
+.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x03f62492307f1519f
+.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x03f3c8034c85dfff0
+
+.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02
+ .quad 0x03fb5555555555557
+.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02
+ .quad 0x03f89999999865ede
+.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03
+ .quad 0x03f6249423bd94741
+.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x03fe62e42e0000000
+.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x03e6efa39ef35793c
+
+.L__real_half: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+
+
+.L__np_ln_lead_table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02
+ .quad 0x3f9f829800000000 # 3.07716131210327148438e-02
+ .quad 0x3fa7745800000000 # 4.58095073699951171875e-02
+ .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02
+ .quad 0x3fb341d700000000 # 7.52233862876892089844e-02
+ .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02
+ .quad 0x3fba926d00000000 # 1.03796780109405517578e-01
+ .quad 0x3fbe270700000000 # 1.17783010005950927734e-01
+ .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01
+ .quad 0x3fc2955280000000 # 1.45181953907012939453e-01
+ .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01
+ .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01
+ .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01
+ .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01
+ .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01
+ .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01
+ .quad 0x3fce270700000000 # 2.35566020011901855469e-01
+ .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01
+ .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01
+ .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01
+ .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01
+ .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01
+ .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01
+ .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01
+ .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01
+ .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01
+ .quad 0x3fd686c800000000 # 3.51976394653320312500e-01
+ .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01
+ .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01
+ .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01
+ .quad 0x3fd9479400000000 # 3.94993782043457031250e-01
+ .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01
+ .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01
+ .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01
+ .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01
+ .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01
+ .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01
+ .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01
+ .quad 0x3fde744240000000 # 4.75845873355865478516e-01
+ .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01
+ .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01
+ .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01
+ .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01
+ .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01
+ .quad 0x3fe109f380000000 # 5.32464742660522460938e-01
+ .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01
+ .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01
+ .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01
+ .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01
+ .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01
+ .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01
+ .quad 0x3fe307d720000000 # 5.94707071781158447266e-01
+ .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01
+ .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01
+ .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01
+ .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01
+ .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01
+ .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01
+ .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01
+ .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01
+ .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01
+ .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01
+ .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01
+ .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_tail_table:
+ .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00
+ .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09
+ .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08
+ .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08
+ .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08
+ .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08
+ .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08
+ .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08
+ .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08
+ .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08
+ .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08
+ .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08
+ .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08
+ .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10
+ .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08
+ .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08
+ .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08
+ .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08
+ .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08
+ .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08
+ .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08
+ .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08
+ .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08
+ .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08
+ .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09
+ .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09
+ .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08
+ .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08
+ .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08
+ .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08
+ .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09
+ .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08
+ .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08
+ .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08
+ .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08
+ .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08
+ .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09
+ .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08
+ .quad 0x03e33071282fb989b # 4.43021445893361960146e-09
+ .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08
+ .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08
+ .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08
+ .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08
+ .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08
+ .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09
+ .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08
+ .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08
+ .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08
+ .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08
+ .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08
+ .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08
+ .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08
+ .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08
+ .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08
+ .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08
+ .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08
+ .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08
+ .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09
+ .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08
+ .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08
+ .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08
+ .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08
+ .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08
+ .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08
+ .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08
+ .quad 0 # for alignment
+
diff --git a/src/gas/vrdalog10.S b/src/gas/vrdalog10.S
new file mode 100644
index 0000000..f766b62
--- /dev/null
+++ b/src/gas/vrdalog10.S
@@ -0,0 +1,1021 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdalog10.s
+#
+# An array implementation of the log10 libm function.
+#
+# Prototype:
+#
+# void vrda_log10(int n, double *x, double *y);
+#
+# Computes the natural log10 of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error. This version can compute log10s in 50-55
+# cycles with n <= 24
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .weak vrda_log10_
+ .set vrda_log10_,__vrda_log10__
+ .weak vrda_log10__
+ .set vrda_log10__,__vrda_log10__
+
+# parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array log10
+#** VRDA_LOG(N,X,Y)
+# C equivalent*/
+#void vrda_log10__(int * n, double *x, double *y)
+#{
+# vrda_log10(*n,x,y);
+#}
+.globl __vrda_log10__
+ .type __vrda_log10__,@function
+__vrda_log10__:
+ mov (%rdi),%edi
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+.equ p_xexp,0x020 # index storage
+
+.equ p_x2,0x030 # temporary for error checking operation
+.equ p_idx2,0x040 # index storage
+.equ p_xexp2,0x050 # index storage
+
+.equ save_xa,0x060 #qword
+.equ save_ya,0x068 #qword
+.equ save_nv,0x070 #qword
+.equ p_iter,0x078 # qword storage for number of loop iterations
+
+.equ save_rbx,0x080 #qword
+
+
+.equ p2_temp,0x090 # second temporary for get/put bits operation
+.equ p2_temp1,0x0b0 # second temporary for exponent multiply
+
+.equ p_n1,0x0c0 # temporary for near one check
+.equ p_n12,0x0d0 # temporary for near one check
+
+
+.equ stack_size,0x0e8
+
+
+# parameters are passed in by Microsoft C as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl vrda_log10
+ .type vrda_log10,@function
+vrda_log10:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vda_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlpd (%rsi),%xmm0
+ movhpd 8(%rsi),%xmm0
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+ movlpd -16(%rsi),%xmm7
+ movhpd -8(%rsi),%xmm7
+
+# compute the log10s
+
+## if NaN or inf
+ movdqa %xmm0,p_x(%rsp) # save the input values
+
+# /* Store the exponent of x in xexp and put
+# f into the range [0.5,1) */
+
+ pxor %xmm1,%xmm1
+ movdqa %xmm0,%xmm3
+ psrlq $52,%xmm3
+ psubq .L__mask_1023(%rip),%xmm3
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm6 # xexp
+ movdqa %xmm7,p_x2(%rsp) # save the input values
+ movdqa %xmm0,%xmm2
+ subpd .L__real_one(%rip),%xmm2
+
+ movapd %xmm6,p_xexp(%rsp)
+ andpd .L__real_notsign(%rip),%xmm2
+ xor %rax,%rax
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+
+ cmppd $1,.L__real_threshold(%rip),%xmm2
+ movmskpd %xmm2,%ecx
+ movdqa %xmm3,%xmm4
+ mov %ecx,p_n1(%rsp)
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrlq $45,%xmm3
+ movdqa %xmm3,%xmm2
+ psrlq $1,%xmm3
+ paddq .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm2
+ paddq %xmm2,%xmm3
+
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm1
+ pxor %xmm7,%xmm7
+ movdqa p_x2(%rsp),%xmm2
+ movapd p_x2(%rsp),%xmm5
+ psrlq $52,%xmm2
+ psubq .L__mask_1023(%rip),%xmm2
+ packssdw %xmm7,%xmm2
+ subpd .L__real_one(%rip),%xmm5
+ andpd .L__real_notsign(%rip),%xmm5
+ cvtdq2pd %xmm2,%xmm6 # xexp
+ xor %rcx,%rcx
+ cmppd $1,.L__real_threshold(%rip),%xmm5
+ movq %xmm3,p_idx(%rsp)
+
+# reduce and get u
+ por .L__real_half(%rip),%xmm4
+ movdqa %xmm4,%xmm2
+ movapd %xmm6,p_xexp2(%rsp)
+
+ # do near one check
+ movmskpd %xmm5,%edx
+ mov %edx,p_n12(%rsp)
+
+ mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128
+
+
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx(%rsp),%eax
+ movdqa p_x2(%rsp),%xmm6
+
+ movapd .L__real_half(%rip),%xmm5 # .5
+ subpd %xmm1,%xmm2 # f2 = f - f1
+ pand .L__real_mant(%rip),%xmm6
+ mulpd %xmm2,%xmm5
+ addpd %xmm5,%xmm1
+
+ movdqa %xmm6,%xmm8
+ psrlq $45,%xmm6
+ movdqa %xmm6,%xmm4
+
+ psrlq $1,%xmm6
+ paddq .L__mask_040(%rip),%xmm6
+ pand .L__mask_001(%rip),%xmm4
+ paddq %xmm4,%xmm6
+# do error checking here for scheduling. Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+## if NaN or inf
+ movapd %xmm0,%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r8d
+ packssdw %xmm7,%xmm6
+ por .L__real_half(%rip),%xmm8
+ movq %xmm6,p_idx2(%rsp)
+ cvtdq2pd %xmm6,%xmm9
+
+ cmppd $2,.L__real_zero(%rip),%xmm0
+ mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128
+ movmskpd %xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+ divpd %xmm1,%xmm2 # u
+
+# compute the index into the log10 tables
+#
+
+ movlpd -512(%rdx,%rax,8),%xmm0 # z1
+ mov p_idx+4(%rsp),%ecx
+ movhpd -512(%rdx,%rcx,8),%xmm0 # z1
+# solve for ln(1+u)
+ movapd %xmm2,%xmm1 # u
+ mulpd %xmm2,%xmm2 # u^2
+ movapd %xmm2,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm2,%xmm3 #Cu2
+ mulpd %xmm1,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm2 # u^5
+ movapd .L__real_log2_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm1 # u+Au3
+ mulpd %xmm3,%xmm2 # u5(B+Cu2)
+
+ movapd p_xexp(%rsp),%xmm5 # xexp
+ addpd %xmm2,%xmm1 # poly
+# recombine
+ mulpd %xmm5,%xmm4 # xexp * log2_lead
+ addpd %xmm4,%xmm0 #r1
+ movapd %xmm0,%xmm2 #for log10
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm4 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm4 #z2 +=q
+ mulpd .L__real_log10e_tail(%rip),%xmm0 #for log10
+ mulpd .L__real_log10e_lead(%rip),%xmm2 #for log10
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%eax
+ mov p_idx2+4(%rsp),%ecx
+ addpd %xmm4,%xmm1
+
+ mulpd .L__real_log2_tail(%rip),%xmm5
+
+ movapd .L__real_half(%rip),%xmm4 # .5
+ subpd %xmm9,%xmm8 # f2 = f - f1
+ mulpd %xmm8,%xmm4
+ addpd %xmm4,%xmm9
+
+ addpd %xmm5,%xmm1 #r2
+ movapd %xmm1,%xmm7 #for log10
+ mulpd .L__real_log10e_tail(%rip),%xmm1 #for log10
+ addpd %xmm1,%xmm0 #for log10
+
+ divpd %xmm9,%xmm8 # u
+ movapd p_x2(%rsp),%xmm3
+ mulpd .L__real_log10e_lead(%rip),%xmm7 #log10
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r10d
+ addpd %xmm7,%xmm0 #for log10
+ movapd p_x2(%rsp),%xmm6
+ cmppd $2,.L__real_zero(%rip),%xmm6
+ movmskpd %xmm6,%r11d
+
+# check for nans/infs
+ test $3,%r8d
+ addpd %xmm2,%xmm0 #for log10
+# addpd %xmm1,%xmm0
+ jnz .L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+ test $3,%r9d
+ jnz .L__z_or_n
+
+.L__vlog2:
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+
+ # It seems like a good idea to try and interleave
+ # even more of the following code sooner into the
+ # program. But there were conflicts with the table
+ # index registers, making the problem difficult.
+ # After a lot of work in a branch of this file,
+ # I was not able to match the speed of this version.
+ # CodeAnalyst shows that there is lots of unused add
+ # pipe time around the divides, but the processor
+ # doesn't seem to be able to schedule in those slots.
+
+ movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q
+
+# check for near one
+ mov p_n1(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one1
+.L__vlog2n:
+
+ # solve for ln(1+u)
+ movapd %xmm8,%xmm9 # u
+ mulpd %xmm8,%xmm8 # u^2
+ movapd %xmm8,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm8,%xmm3 #Cu2
+ mulpd %xmm9,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm8 # u^5
+ movapd .L__real_log2_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm9 # u+Au3
+ mulpd %xmm3,%xmm8 # u5(B+Cu2)
+
+ movapd p_xexp2(%rsp),%xmm5 # xexp
+ addpd %xmm8,%xmm9 # poly
+ # recombine
+ mulpd %xmm5,%xmm4
+ addpd %xmm4,%xmm7 #r1
+ movapd %xmm7,%xmm6 #for log10
+
+ lea .L__np_ln_tail_table(%rip),%rdx
+ mulpd .L__real_log10e_tail(%rip),%xmm7 #for log10
+ movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q
+ mulpd .L__real_log10e_lead(%rip),%xmm6 #for log10
+ movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q
+ addpd %xmm2,%xmm9
+
+ mulpd .L__real_log2_tail(%rip),%xmm5
+
+ addpd %xmm5,%xmm9 #r2
+ movapd %xmm9,%xmm8 #for log10
+ mulpd .L__real_log10e_tail(%rip),%xmm9 #for log 10
+ addpd %xmm9,%xmm7 #for log10
+ mulpd .L__real_log10e_lead(%rip),%xmm8 #for log10
+ addpd %xmm8,%xmm7 #for log10
+
+ # check for nans/infs
+ test $3,%r10d
+ addpd %xmm6,%xmm7 #for log10
+# addpd %xmm9,%xmm7
+ jnz .L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+ test $3,%r11d
+ jnz .L__z_or_n2
+
+.L__vlog4:
+ mov p_n12(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one2
+
+.L__vlog4n:
+
+
+#__vda_bottom2:
+
+ prefetch 64(%rdi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movlpd %xmm7,-16(%rdi)
+ movhpd %xmm7,-8(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vda_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vda_cleanup
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+.Lboth_nearone:
+# saves 10 cycles
+# r = x - 1.0;
+ movapd .L__real_two(%rip),%xmm2
+ subpd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addpd %xmm0,%xmm2
+ movapd %xmm0,%xmm1
+ divpd %xmm2,%xmm1 # u
+ movapd .L__real_ca4(%rip),%xmm4 #D
+ movapd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movapd %xmm0,%xmm6
+ mulpd %xmm1,%xmm6 # correction
+# u = u + u;
+ addpd %xmm1,%xmm1 #u
+ movapd %xmm1,%xmm2
+ mulpd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulpd %xmm1,%xmm5 # Cu
+ movapd %xmm1,%xmm3
+ mulpd %xmm2,%xmm3 # u^3
+ mulpd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulpd %xmm3,%xmm4 #Du^3
+
+ addpd .L__real_ca1(%rip),%xmm2 # +A
+ movapd %xmm3,%xmm1
+ mulpd %xmm1,%xmm1 # u^6
+ addpd %xmm4,%xmm5 #Cu+Du3
+
+ mulpd %xmm3,%xmm2 #u3(A+Bu2)
+ mulpd %xmm5,%xmm1 #u6(Cu+Du3)
+ addpd %xmm1,%xmm2
+ subpd %xmm6,%xmm2 # -correction
+
+# loge to log10
+ movapd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subpd %xmm3,%xmm0
+ addpd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movapd %xmm3,%xmm0
+ movapd %xmm2,%xmm1
+
+ mulpd .L__real_log10e_tail(%rip),%xmm2
+ mulpd .L__real_log10e_tail(%rip),%xmm0
+ mulpd .L__real_log10e_lead(%rip),%xmm1
+ mulpd .L__real_log10e_lead(%rip),%xmm3
+ addpd %xmm2,%xmm0
+ addpd %xmm1,%xmm0
+ addpd %xmm3,%xmm0
+
+# return r + r2;
+# addpd %xmm2,%xmm0
+ ret
+
+ .align 16
+.L__near_one1:
+ cmp $3,%r9d
+ jnz .L__n1nb1
+
+ movapd p_x(%rsp),%xmm0
+ call .Lboth_nearone
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+ jmp .L__vlog2n
+
+ .align 16
+.L__n1nb1:
+ test $1,%r9d
+ jz .L__lnn12
+
+ movlpd p_x(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,(%rdi)
+
+.L__lnn12:
+ test $2,%r9d # second number?
+ jz .L__lnn1e
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,8(%rdi)
+
+.L__lnn1e:
+ jmp .L__vlog2n
+
+
+ .align 16
+.L__near_one2:
+ cmp $3,%r9d
+ jnz .L__n1nb2
+
+ movapd p_x2(%rsp),%xmm0
+ call .Lboth_nearone
+ movapd %xmm0,%xmm7
+ jmp .L__vlog4n
+
+ .align 16
+.L__n1nb2:
+ test $1,%r9d
+ jz .L__lnn22
+
+ movlpd p_x2(%rsp),%xmm0
+ call .L__ln1
+ movsd %xmm0,%xmm7
+
+.L__lnn22:
+ test $2,%r9d # second number?
+ jz .L__lnn2e
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__ln1
+ movlhps %xmm0,%xmm7
+
+.L__lnn2e:
+ jmp .L__vlog4n
+
+ .align 16
+
+.L__ln1:
+# saves 10 cycles
+# r = x - 1.0;
+ movlpd .L__real_two(%rip),%xmm2
+ subsd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addsd %xmm0,%xmm2
+ movsd %xmm0,%xmm1
+ divsd %xmm2,%xmm1 # u
+ movlpd .L__real_ca4(%rip),%xmm4 #D
+ movlpd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movsd %xmm0,%xmm6
+ mulsd %xmm1,%xmm6 # correction
+# u = u + u;
+ addsd %xmm1,%xmm1 #u
+ movsd %xmm1,%xmm2
+ mulsd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulsd %xmm1,%xmm5 # Cu
+ movsd %xmm1,%xmm3
+ mulsd %xmm2,%xmm3 # u^3
+ mulsd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulsd %xmm3,%xmm4 #Du^3
+
+ addsd .L__real_ca1(%rip),%xmm2 # +A
+ movsd %xmm3,%xmm1
+ mulsd %xmm1,%xmm1 # u^6
+ addsd %xmm4,%xmm5 #Cu+Du3
+
+ mulsd %xmm3,%xmm2 #u3(A+Bu2)
+ mulsd %xmm5,%xmm1 #u6(Cu+Du3)
+ addsd %xmm1,%xmm2
+ subsd %xmm6,%xmm2 # -correction
+
+# loge to log10
+ movsd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subsd %xmm3,%xmm0
+ addsd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movsd %xmm3,%xmm0
+ movsd %xmm2,%xmm1
+
+ mulsd .L__real_log10e_tail(%rip),%xmm2
+ mulsd .L__real_log10e_tail(%rip),%xmm0
+ mulsd .L__real_log10e_lead(%rip),%xmm1
+ mulsd .L__real_log10e_lead(%rip),%xmm3
+ addsd %xmm2,%xmm0
+ addsd %xmm1,%xmm0
+ addsd %xmm3,%xmm0
+
+# return r + r2;
+# addsd %xmm2,%xmm0
+ ret
+
+ .align 16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+ test $1,%r8d # first number?
+ jz .L__lninf2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rdx
+ movlpd p_x(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+ test $2,%r8d # second number?
+ jz .L__lninfe
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rdx
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+ jmp .L__vlog1 # continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+ test $1,%r10d # first number?
+ jz .L__lninf22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm7,%xmm1 # save the inputs
+ mov p_x2(%rsp),%rdx
+ movlpd p_x2(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm7,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+ movapd %xmm0,%xmm7
+
+.L__lninf22:
+ test $2,%r10d # second number?
+ jz .L__lninfe2
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rdx
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+ jmp .L__vlog3 # continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__lnan # jump if mantissa not zero, so it's a NaN
+# inf
+ rcl $1,%rdx
+ jnc .L__lne2 # log(+inf) = inf
+# negative x
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+#NaN
+.L__lnan:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__lne:
+ movd %rdx,%xmm0
+.L__lne2:
+ ret
+
+ .align 16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+ test $1,%r9d # first number?
+ jz .L__zn2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+ test $2,%r9d # second number?
+ jz .L__zne
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne:
+ jmp .L__vlog2
+
+.L__z_or_n2:
+ test $1,%r11d # first number?
+ jz .L__zn22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm7,%xmm0
+ movapd %xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+ test $2,%r11d # second number?
+ jz .L__zne2
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+ jmp .L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+ shl $1,%rax
+ jnz .L__zn_x # if just a carry, then must be negative
+ movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0
+ ret
+.L__zn_x:
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+# we assume that rdx is pointing at the next x array element,
+# r8 at the next y array element. The number of values left is in
+# save_nv
+.L__vda_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__finish # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorpd %xmm0,%xmm0
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd %xmm0,p_x+16(%rsp)
+
+ mov (%rsi),%rcx # we know there's at least one
+ mov %rcx,p_x(%rsp)
+ cmp $2,%rax
+ jl .L__vdacg
+
+ mov 8(%rsi),%rcx # do the second value
+ mov %rcx,p_x+8(%rsp)
+ cmp $3,%rax
+ jl .L__vdacg
+
+ mov 16(%rsi),%rcx # do the third value
+ mov %rcx,p_x+16(%rsp)
+
+.L__vdacg:
+ mov $4,%rdi # parameter for N
+ lea p_x(%rsp),%rsi # &x parameter
+ lea p2_temp(%rsp),%rdx # &y parameter
+ call vrda_log10@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p2_temp(%rsp),%rcx
+ mov %rcx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+8(%rsp),%rcx
+ mov %rcx,8(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+16(%rsp),%rcx
+ mov %rcx,16(%rdi) # do the third value
+
+.L__vdacgf:
+ jmp .L__finish
+
+ .data
+ .align 64
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_two: .quad 0x04000000000000000 # 2.0
+ .quad 0x04000000000000000
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0fff0000000000000
+.L__real_inf: .quad 0x07ff0000000000000 # +inf
+ .quad 0x07ff0000000000000
+.L__real_nan: .quad 0x07ff8000000000000 # NaN
+ .quad 0x07ff8000000000000
+
+.L__real_zero: .quad 0x00000000000000000 # 0.0
+ .quad 0x00000000000000000
+
+.L__real_sign: .quad 0x08000000000000000 # sign bit
+ .quad 0x08000000000000000
+.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold: .quad 0x03FB082C000000000 # .064495086669921875 Threshold
+ .quad 0x03FB082C000000000
+.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit
+ .quad 0x00008000000000000
+.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03f80000000000000
+.L__mask_1023: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+.L__mask_040: .quad 0x00000000000000040 #
+ .quad 0x00000000000000040
+.L__mask_001: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001
+
+.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x03fb55555555554e6
+.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x03f89999999bac6d4
+.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x03f62492307f1519f
+.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x03f3c8034c85dfff0
+
+.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02
+ .quad 0x03fb5555555555557
+.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02
+ .quad 0x03f89999999865ede
+.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03
+ .quad 0x03f6249423bd94741
+.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x03fe62e42e0000000
+.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x03e6efa39ef35793c
+
+.L__real_half: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+
+.L__real_log10e_lead: .quad 0x03fdbcb7800000000 # log10e_lead 4.34293746948242187500e-01
+ .quad 0x03fdbcb7800000000
+.L__real_log10e_tail: .quad 0x03ea8a93728719535 # log10e_tail 7.3495500964015109100644e-7
+ .quad 0x03ea8a93728719535
+
+.L__mask_lower: .quad 0x0ffffffff00000000
+ .quad 0x0ffffffff00000000
+
+.L__np_ln_lead_table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02
+ .quad 0x3f9f829800000000 # 3.07716131210327148438e-02
+ .quad 0x3fa7745800000000 # 4.58095073699951171875e-02
+ .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02
+ .quad 0x3fb341d700000000 # 7.52233862876892089844e-02
+ .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02
+ .quad 0x3fba926d00000000 # 1.03796780109405517578e-01
+ .quad 0x3fbe270700000000 # 1.17783010005950927734e-01
+ .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01
+ .quad 0x3fc2955280000000 # 1.45181953907012939453e-01
+ .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01
+ .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01
+ .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01
+ .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01
+ .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01
+ .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01
+ .quad 0x3fce270700000000 # 2.35566020011901855469e-01
+ .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01
+ .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01
+ .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01
+ .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01
+ .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01
+ .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01
+ .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01
+ .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01
+ .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01
+ .quad 0x3fd686c800000000 # 3.51976394653320312500e-01
+ .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01
+ .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01
+ .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01
+ .quad 0x3fd9479400000000 # 3.94993782043457031250e-01
+ .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01
+ .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01
+ .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01
+ .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01
+ .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01
+ .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01
+ .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01
+ .quad 0x3fde744240000000 # 4.75845873355865478516e-01
+ .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01
+ .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01
+ .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01
+ .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01
+ .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01
+ .quad 0x3fe109f380000000 # 5.32464742660522460938e-01
+ .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01
+ .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01
+ .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01
+ .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01
+ .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01
+ .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01
+ .quad 0x3fe307d720000000 # 5.94707071781158447266e-01
+ .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01
+ .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01
+ .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01
+ .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01
+ .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01
+ .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01
+ .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01
+ .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01
+ .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01
+ .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01
+ .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01
+ .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_tail_table:
+ .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00
+ .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09
+ .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08
+ .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08
+ .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08
+ .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08
+ .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08
+ .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08
+ .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08
+ .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08
+ .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08
+ .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08
+ .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08
+ .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10
+ .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08
+ .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08
+ .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08
+ .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08
+ .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08
+ .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08
+ .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08
+ .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08
+ .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08
+ .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08
+ .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09
+ .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09
+ .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08
+ .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08
+ .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08
+ .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08
+ .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09
+ .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08
+ .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08
+ .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08
+ .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08
+ .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08
+ .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09
+ .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08
+ .quad 0x03e33071282fb989b # 4.43021445893361960146e-09
+ .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08
+ .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08
+ .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08
+ .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08
+ .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08
+ .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09
+ .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08
+ .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08
+ .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08
+ .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08
+ .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08
+ .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08
+ .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08
+ .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08
+ .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08
+ .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08
+ .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08
+ .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08
+ .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09
+ .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08
+ .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08
+ .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08
+ .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08
+ .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08
+ .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08
+ .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08
+ .quad 0 # for alignment
+
diff --git a/src/gas/vrdalog2.S b/src/gas/vrdalog2.S
new file mode 100644
index 0000000..0200f03
--- /dev/null
+++ b/src/gas/vrdalog2.S
@@ -0,0 +1,1003 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdalog.s
+#
+# An array implementation of the log libm function.
+#
+# Prototype:
+#
+# void vrda_log2(int n, double *x, double *y);
+#
+# Computes the natural log of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error. This version can compute logs in 44
+# cycles with n <= 24
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+.equ p_idx,0x010 # index storage
+.equ p_xexp,0x020 # index storage
+
+.equ p_x2,0x030 # temporary for error checking operation
+.equ p_idx2,0x040 # index storage
+.equ p_xexp2,0x050 # index storage
+
+.equ save_xa,0x060 #qword
+.equ save_ya,0x068 #qword
+.equ save_nv,0x070 #qword
+.equ p_iter,0x078 # qword storage for number of loop iterations
+
+.equ save_rbx,0x080 #qword
+
+
+.equ p2_temp,0x090 # second temporary for get/put bits operation
+.equ p2_temp1,0x0b0 # second temporary for exponent multiply
+
+.equ p_n1,0x0c0 # temporary for near one check
+.equ p_n12,0x0d0 # temporary for near one check
+
+
+.equ stack_size,0x0e8
+
+ .weak vrda_log2_
+ .set vrda_log2_,__vrda_log2__
+ .weak vrda_log2__
+ .set vrda_log2__,__vrda_log2__
+
+# parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array log
+#** VRDA_LOG(N,X,Y)
+# C equivalent*/
+#void vrda_log2__(int * n, double *x, double *y)
+#{
+# vrda_log2(*n,x,y);
+#}
+.globl __vrda_log2__
+ .type __vrda_log2__,@function
+__vrda_log2__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+.globl vrda_log2
+ .type vrda_log2,@function
+vrda_log2:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vda_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlpd (%rsi),%xmm0
+ movhpd 8(%rsi),%xmm0
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+ movlpd -16(%rsi),%xmm7
+ movhpd -8(%rsi),%xmm7
+
+# compute the logs
+
+## if NaN or inf
+ movdqa %xmm0,p_x(%rsp) # save the input values
+
+# /* Store the exponent of x in xexp and put
+# f into the range [0.5,1) */
+
+ pxor %xmm1,%xmm1
+ movdqa %xmm0,%xmm3
+ psrlq $52,%xmm3
+ psubq .L__mask_1023(%rip),%xmm3
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm6 # xexp
+ movdqa %xmm7,p_x2(%rsp) # save the input values
+ movdqa %xmm0,%xmm2
+ subpd .L__real_one(%rip),%xmm2
+
+ movapd %xmm6,p_xexp(%rsp)
+ andpd .L__real_notsign(%rip),%xmm2
+ xor %rax,%rax
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+
+ cmppd $1,.L__real_threshold(%rip),%xmm2
+ movmskpd %xmm2,%ecx
+ movdqa %xmm3,%xmm4
+ mov %ecx,p_n1(%rsp)
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrlq $45,%xmm3
+ movdqa %xmm3,%xmm2
+ psrlq $1,%xmm3
+ paddq .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm2
+ paddq %xmm2,%xmm3
+
+ packssdw %xmm1,%xmm3
+ cvtdq2pd %xmm3,%xmm1
+ pxor %xmm7,%xmm7
+ movdqa p_x2(%rsp),%xmm2
+ movapd p_x2(%rsp),%xmm5
+ psrlq $52,%xmm2
+ psubq .L__mask_1023(%rip),%xmm2
+ packssdw %xmm7,%xmm2
+ subpd .L__real_one(%rip),%xmm5
+ andpd .L__real_notsign(%rip),%xmm5
+ cvtdq2pd %xmm2,%xmm6 # xexp
+ xor %rcx,%rcx
+ cmppd $1,.L__real_threshold(%rip),%xmm5
+ movq %xmm3,p_idx(%rsp)
+
+# reduce and get u
+ por .L__real_half(%rip),%xmm4
+ movdqa %xmm4,%xmm2
+ movapd %xmm6,p_xexp2(%rsp)
+
+ # do near one check
+ movmskpd %xmm5,%edx
+ mov %edx,p_n12(%rsp)
+
+ mulpd .L__real_3f80000000000000(%rip),%xmm1 # f1 = index/128
+
+
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx(%rsp),%eax
+ movdqa p_x2(%rsp),%xmm6
+
+ movapd .L__real_half(%rip),%xmm5 # .5
+ subpd %xmm1,%xmm2 # f2 = f - f1
+ pand .L__real_mant(%rip),%xmm6
+ mulpd %xmm2,%xmm5
+ addpd %xmm5,%xmm1
+
+ movdqa %xmm6,%xmm8
+ psrlq $45,%xmm6
+ movdqa %xmm6,%xmm4
+
+ psrlq $1,%xmm6
+ paddq .L__mask_040(%rip),%xmm6
+ pand .L__mask_001(%rip),%xmm4
+ paddq %xmm4,%xmm6
+# do error checking here for scheduling. Saves a bunch of cycles as
+# compared to doing this at the start of the routine.
+## if NaN or inf
+ movapd %xmm0,%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r8d
+ packssdw %xmm7,%xmm6
+ por .L__real_half(%rip),%xmm8
+ movq %xmm6,p_idx2(%rsp)
+ cvtdq2pd %xmm6,%xmm9
+
+ cmppd $2,.L__real_zero(%rip),%xmm0
+ mulpd .L__real_3f80000000000000(%rip),%xmm9 # f1 = index/128
+ movmskpd %xmm0,%r9d
+# delaying this divide helps, but moving the other one does not.
+# it was after the paddq
+ divpd %xmm1,%xmm2 # u
+
+# compute the index into the log tables
+#
+
+ movlpd -512(%rdx,%rax,8),%xmm0 # z1
+ mov p_idx+4(%rsp),%ecx
+ movhpd -512(%rdx,%rcx,8),%xmm0 # z1
+# solve for ln(1+u)
+ movapd %xmm2,%xmm1 # u
+ mulpd %xmm2,%xmm2 # u^2
+ movapd %xmm2,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm2,%xmm3 #Cu2
+ mulpd %xmm1,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm2 # u^5
+ movapd .L__real_log2e_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm1 # u+Au3
+ movapd %xmm0,%xmm5 #z1 copy
+ mulpd %xmm3,%xmm2 # u5(B+Cu2)
+ movapd .L__real_log2e_tail(%rip),%xmm3
+ movapd p_xexp(%rsp),%xmm6 # xexp
+ addpd %xmm2,%xmm1 # poly
+# recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%eax
+ mov p_idx2+4(%rsp),%ecx
+ addpd %xmm2,%xmm1 #z2
+ movapd %xmm1,%xmm2 #z2 copy
+
+
+ mulpd %xmm4,%xmm5
+ mulpd %xmm4,%xmm1
+ movapd .L__real_half(%rip),%xmm4 # .5
+ subpd %xmm9,%xmm8 # f2 = f - f1
+ mulpd %xmm8,%xmm4
+ addpd %xmm4,%xmm9
+ mulpd %xmm3,%xmm2 #z2*log2e_tail
+ mulpd %xmm3,%xmm0 #z1*log2e_tail
+ addpd %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp
+ addpd %xmm2,%xmm0 #z1*log2e_tail + z2*log2e_tail
+ addpd %xmm1,%xmm0 #r2
+
+ divpd %xmm9,%xmm8 # u
+ movapd p_x2(%rsp),%xmm3
+ andpd .L__real_inf(%rip),%xmm3
+ cmppd $0,.L__real_inf(%rip),%xmm3
+ movmskpd %xmm3,%r10d
+ movapd p_x2(%rsp),%xmm6
+ cmppd $2,.L__real_zero(%rip),%xmm6
+ movmskpd %xmm6,%r11d
+
+# check for nans/infs
+ test $3,%r8d
+ addpd %xmm5,%xmm0
+ jnz .L__log_naninf
+.L__vlog1:
+# check for negative numbers or zero
+ test $3,%r9d
+ jnz .L__z_or_n
+
+.L__vlog2:
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+
+ # It seems like a good idea to try and interleave
+ # even more of the following code sooner into the
+ # program. But there were conflicts with the table
+ # index registers, making the problem difficult.
+ # After a lot of work in a branch of this file,
+ # I was not able to match the speed of this version.
+ # CodeAnalyst shows that there is lots of unused add
+ # pipe time around the divides, but the processor
+ # doesn't seem to be able to schedule in those slots.
+
+ movlpd -512(%rdx,%rax,8),%xmm7 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm7 #z2 +=q
+
+# check for near one
+ mov p_n1(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one1
+.L__vlog2n:
+
+ # solve for ln(1+u)
+ movapd %xmm8,%xmm9 # u
+ mulpd %xmm8,%xmm8 # u^2
+ movapd %xmm8,%xmm5
+ movapd .L__real_cb3(%rip),%xmm3
+ mulpd %xmm8,%xmm3 #Cu2
+ mulpd %xmm9,%xmm5 # u^3
+ addpd .L__real_cb2(%rip),%xmm3 #B+Cu2
+
+ mulpd %xmm5,%xmm8 # u^5
+ movapd .L__real_log2e_lead(%rip),%xmm4
+
+ mulpd .L__real_cb1(%rip),%xmm5 #Au3
+ addpd %xmm5,%xmm9 # u+Au3
+ movapd %xmm7,%xmm5 #z1 copy
+ mulpd %xmm3,%xmm8 # u5(B+Cu2)
+ movapd .L__real_log2e_tail(%rip),%xmm3
+ movapd p_xexp2(%rsp),%xmm6 # xexp
+ addpd %xmm8,%xmm9 # poly
+ # recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movlpd -512(%rdx,%rax,8),%xmm2 #z2 +=q
+ movhpd -512(%rdx,%rcx,8),%xmm2 #z2 +=q
+ addpd %xmm2,%xmm9 #z2
+ movapd %xmm9,%xmm2 #z2 copy
+
+ mulpd %xmm4,%xmm5 #z1*log2e_lead
+ mulpd %xmm4,%xmm9 #z2*log2e_lead
+ mulpd %xmm3,%xmm2 #z2*log2e_tail
+ mulpd %xmm3,%xmm7 #z1*log2e_tail
+ addpd %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp
+ addpd %xmm2,%xmm7 #z1*log2e_tail + z2*log2e_tail
+
+
+ addpd %xmm9,%xmm7 #r2
+
+ # check for nans/infs
+ test $3,%r10d
+ addpd %xmm5,%xmm7
+ jnz .L__log_naninf2
+.L__vlog3:
+# check for negative numbers or zero
+ test $3,%r11d
+ jnz .L__z_or_n2
+
+.L__vlog4:
+ mov p_n12(%rsp),%r9d
+ test $3,%r9d
+ jnz .L__near_one2
+
+.L__vlog4n:
+
+
+#__vda_bottom2:
+
+ prefetch 64(%rdi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movlpd %xmm7,-16(%rdi)
+ movhpd %xmm7,-8(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vda_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vda_cleanup
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+.Lboth_nearone:
+# saves 10 cycles
+# r = x - 1.0;
+ movapd .L__real_two(%rip),%xmm2
+ subpd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addpd %xmm0,%xmm2
+ movapd %xmm0,%xmm1
+ divpd %xmm2,%xmm1 # u
+ movapd .L__real_ca4(%rip),%xmm4 #D
+ movapd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movapd %xmm0,%xmm6
+ mulpd %xmm1,%xmm6 # correction
+# u = u + u;
+ addpd %xmm1,%xmm1 #u
+ movapd %xmm1,%xmm2
+ mulpd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulpd %xmm1,%xmm5 # Cu
+ movapd %xmm1,%xmm3
+ mulpd %xmm2,%xmm3 # u^3
+ mulpd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulpd %xmm3,%xmm4 #Du^3
+
+ addpd .L__real_ca1(%rip),%xmm2 # +A
+ movapd %xmm3,%xmm1
+ mulpd %xmm1,%xmm1 # u^6
+ addpd %xmm4,%xmm5 #Cu+Du3
+
+ mulpd %xmm3,%xmm2 #u3(A+Bu2)
+ mulpd %xmm5,%xmm1 #u6(Cu+Du3)
+ addpd %xmm1,%xmm2
+ subpd %xmm6,%xmm2 # -correction
+
+# loge to log2
+ movapd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subpd %xmm3,%xmm0
+ addpd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movapd %xmm3,%xmm0
+ movapd %xmm2,%xmm1
+
+ mulpd .L__real_log2e_tail(%rip),%xmm2
+ mulpd .L__real_log2e_tail(%rip),%xmm0
+ mulpd .L__real_log2e_lead(%rip),%xmm1
+ mulpd .L__real_log2e_lead(%rip),%xmm3
+ addpd %xmm2,%xmm0
+ addpd %xmm1,%xmm0
+ addpd %xmm3,%xmm0
+# return r + r2;
+# addpd %xmm2,%xmm0
+ ret
+
+ .align 16
+.L__near_one1:
+ cmp $3,%r9d
+ jnz .L__n1nb1
+
+ movapd p_x(%rsp),%xmm0
+ call .Lboth_nearone
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+ jmp .L__vlog2n
+
+ .align 16
+.L__n1nb1:
+ test $1,%r9d
+ jz .L__lnn12
+
+ movlpd p_x(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,(%rdi)
+
+.L__lnn12:
+ test $2,%r9d # second number?
+ jz .L__lnn1e
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__ln1
+ movlpd %xmm0,8(%rdi)
+
+.L__lnn1e:
+ jmp .L__vlog2n
+
+
+ .align 16
+.L__near_one2:
+ cmp $3,%r9d
+ jnz .L__n1nb2
+
+ movapd p_x2(%rsp),%xmm0
+ call .Lboth_nearone
+ movapd %xmm0,%xmm7
+ jmp .L__vlog4n
+
+ .align 16
+.L__n1nb2:
+ test $1,%r9d
+ jz .L__lnn22
+
+ movlpd p_x2(%rsp),%xmm0
+ call .L__ln1
+ movsd %xmm0,%xmm7
+
+.L__lnn22:
+ test $2,%r9d # second number?
+ jz .L__lnn2e
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__ln1
+ movlhps %xmm0,%xmm7
+
+.L__lnn2e:
+ jmp .L__vlog4n
+
+ .align 16
+
+.L__ln1:
+# saves 10 cycles
+# r = x - 1.0;
+ movlpd .L__real_two(%rip),%xmm2
+ subsd .L__real_one(%rip),%xmm0 # r
+# u = r / (2.0 + r);
+ addsd %xmm0,%xmm2
+ movsd %xmm0,%xmm1
+ divsd %xmm2,%xmm1 # u
+ movlpd .L__real_ca4(%rip),%xmm4 #D
+ movlpd .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movsd %xmm0,%xmm6
+ mulsd %xmm1,%xmm6 # correction
+# u = u + u;
+ addsd %xmm1,%xmm1 #u
+ movsd %xmm1,%xmm2
+ mulsd %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulsd %xmm1,%xmm5 # Cu
+ movsd %xmm1,%xmm3
+ mulsd %xmm2,%xmm3 # u^3
+ mulsd .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulsd %xmm3,%xmm4 #Du^3
+
+ addsd .L__real_ca1(%rip),%xmm2 # +A
+ movsd %xmm3,%xmm1
+ mulsd %xmm1,%xmm1 # u^6
+ addsd %xmm4,%xmm5 #Cu+Du3
+
+ mulsd %xmm3,%xmm2 #u3(A+Bu2)
+ mulsd %xmm5,%xmm1 #u6(Cu+Du3)
+ addsd %xmm1,%xmm2
+ subsd %xmm6,%xmm2 # -correction
+
+# loge to log2
+ movsd %xmm0,%xmm3 #r1 = r
+ pand .L__mask_lower(%rip),%xmm3
+ subsd %xmm3,%xmm0
+ addsd %xmm0,%xmm2 #r2 = r2 + (r - r1);
+
+ movsd %xmm3,%xmm0
+ movsd %xmm2,%xmm1
+
+ mulsd .L__real_log2e_tail(%rip),%xmm2
+ mulsd .L__real_log2e_tail(%rip),%xmm0
+ mulsd .L__real_log2e_lead(%rip),%xmm1
+ mulsd .L__real_log2e_lead(%rip),%xmm3
+ addsd %xmm2,%xmm0
+ addsd %xmm1,%xmm0
+ addsd %xmm3,%xmm0
+
+# return r + r2;
+# addsd %xmm2,%xmm0
+ ret
+
+ .align 16
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf:
+ test $1,%r8d # first number?
+ jz .L__lninf2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rdx
+ movlpd p_x(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninf2:
+ test $2,%r8d # second number?
+ jz .L__lninfe
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rdx
+ movlpd p_x+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe:
+ jmp .L__vlog1 # continue processing if not
+
+# at least one of the numbers was a nan or infinity
+.L__log_naninf2:
+ test $1,%r10d # first number?
+ jz .L__lninf22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm7,%xmm1 # save the inputs
+ mov p_x2(%rsp),%rdx
+ movlpd p_x2(%rsp),%xmm0
+ call .L__lni
+ shufpd $2,%xmm7,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+ movapd %xmm0,%xmm7
+
+.L__lninf22:
+ test $2,%r10d # second number?
+ jz .L__lninfe2
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rdx
+ movlpd p_x2+8(%rsp),%xmm0
+ call .L__lni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__lninfe2:
+ jmp .L__vlog3 # continue processing if not
+
+# a subroutine to treat one number for nan/infinity
+# the number is expected in rdx and returned in the low
+# half of xmm0
+.L__lni:
+ mov $0x0000FFFFFFFFFFFFF,%rax
+ test %rax,%rdx
+ jnz .L__lnan # jump if mantissa not zero, so it's a NaN
+# inf
+ rcl $1,%rdx
+ jnc .L__lne2 # log(+inf) = inf
+# negative x
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+#NaN
+.L__lnan:
+ mov $0x00008000000000000,%rax # convert to quiet
+ or %rax,%rdx
+.L__lne:
+ movd %rdx,%xmm0
+.L__lne2:
+ ret
+
+ .align 16
+
+# at least one of the numbers was a zero, a negative number, or both.
+.L__z_or_n:
+ test $1,%r9d # first number?
+ jz .L__zn2
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn2:
+ test $2,%r9d # second number?
+ jz .L__zne
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ movapd %xmm0,%xmm1 # save the inputs
+ mov p_x+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm1
+ movapd %xmm1,%xmm0
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne:
+ jmp .L__vlog2
+
+.L__z_or_n2:
+ test $1,%r11d # first number?
+ jz .L__zn22
+
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2(%rsp),%rax
+ call .L__zni
+ shufpd $2,%xmm7,%xmm0
+ movapd %xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zn22:
+ test $2,%r11d # second number?
+ jz .L__zne2
+ mov %rax,p2_temp(%rsp)
+ mov %rdx,p2_temp+8(%rsp)
+ mov p_x2+8(%rsp),%rax
+ call .L__zni
+ shufpd $0,%xmm0,%xmm7
+ mov p2_temp(%rsp),%rax
+ mov p2_temp+8(%rsp),%rdx
+
+.L__zne2:
+ jmp .L__vlog4
+# a subroutine to treat one number for zero or negative values
+# the number is expected in rax and returned in the low
+# half of xmm0
+.L__zni:
+ shl $1,%rax
+ jnz .L__zn_x # if just a carry, then must be negative
+ movlpd .L__real_ninf(%rip),%xmm0 # C99 specs -inf for +-0
+ ret
+.L__zn_x:
+ movlpd .L__real_nan(%rip),%xmm0
+ ret
+
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+# we assume that rdx is pointing at the next x array element,
+# r8 at the next y array element. The number of values left is in
+# save_nv
+.L__vda_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__finish # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorpd %xmm0,%xmm0
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd %xmm0,p_x+16(%rsp)
+
+ mov (%rsi),%rcx # we know there's at least one
+ mov %rcx,p_x(%rsp)
+ cmp $2,%rax
+ jl .L__vdacg
+
+ mov 8(%rsi),%rcx # do the second value
+ mov %rcx,p_x+8(%rsp)
+ cmp $3,%rax
+ jl .L__vdacg
+
+ mov 16(%rsi),%rcx # do the third value
+ mov %rcx,p_x+16(%rsp)
+
+.L__vdacg:
+ mov $4,%rdi # parameter for N
+ lea p_x(%rsp),%rsi # &x parameter
+ lea p2_temp(%rsp),%rdx # &y parameter
+ call vrda_log2@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p2_temp(%rsp),%rcx
+ mov %rcx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+8(%rsp),%rcx
+ mov %rcx,8(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+16(%rsp),%rcx
+ mov %rcx,16(%rdi) # do the third value
+
+.L__vdacgf:
+ jmp .L__finish
+
+ .data
+ .align 64
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_two: .quad 0x04000000000000000 # 2.0
+ .quad 0x04000000000000000
+.L__real_ninf: .quad 0x0fff0000000000000 # -inf
+ .quad 0x0fff0000000000000
+.L__real_inf: .quad 0x07ff0000000000000 # +inf
+ .quad 0x07ff0000000000000
+.L__real_nan: .quad 0x07ff8000000000000 # NaN
+ .quad 0x07ff8000000000000
+
+.L__real_zero: .quad 0x00000000000000000 # 0.0
+ .quad 0x00000000000000000
+
+.L__real_sign: .quad 0x08000000000000000 # sign bit
+ .quad 0x08000000000000000
+.L__real_notsign: .quad 0x07ffFFFFFFFFFFFFF # ^sign bit
+ .quad 0x07ffFFFFFFFFFFFFF
+.L__real_threshold: .quad 0x03F9EB85000000000 # .03
+ .quad 0x03F9EB85000000000
+.L__real_qnanbit: .quad 0x00008000000000000 # quiet nan bit
+ .quad 0x00008000000000000
+.L__real_mant: .quad 0x0000FFFFFFFFFFFFF # mantissa bits
+ .quad 0x0000FFFFFFFFFFFFF
+.L__real_3f80000000000000: .quad 0x03f80000000000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03f80000000000000
+.L__mask_1023: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+.L__mask_040: .quad 0x00000000000000040 #
+ .quad 0x00000000000000040
+.L__mask_001: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001
+
+.L__real_ca1: .quad 0x03fb55555555554e6 # 8.33333333333317923934e-02
+ .quad 0x03fb55555555554e6
+.L__real_ca2: .quad 0x03f89999999bac6d4 # 1.25000000037717509602e-02
+ .quad 0x03f89999999bac6d4
+.L__real_ca3: .quad 0x03f62492307f1519f # 2.23213998791944806202e-03
+ .quad 0x03f62492307f1519f
+.L__real_ca4: .quad 0x03f3c8034c85dfff0 # 4.34887777707614552256e-04
+ .quad 0x03f3c8034c85dfff0
+
+.L__real_cb1: .quad 0x03fb5555555555557 # 8.33333333333333593622e-02
+ .quad 0x03fb5555555555557
+.L__real_cb2: .quad 0x03f89999999865ede # 1.24999999978138668903e-02
+ .quad 0x03f89999999865ede
+.L__real_cb3: .quad 0x03f6249423bd94741 # 2.23219810758559851206e-03
+ .quad 0x03f6249423bd94741
+.L__real_log2_lead: .quad 0x03fe62e42e0000000 # log2_lead 6.93147122859954833984e-01
+ .quad 0x03fe62e42e0000000
+.L__real_log2_tail: .quad 0x03e6efa39ef35793c # log2_tail 5.76999904754328540596e-08
+ .quad 0x03e6efa39ef35793c
+
+.L__real_half: .quad 0x03fe0000000000000 # 1/2
+ .quad 0x03fe0000000000000
+.L__real_log2e_lead: .quad 0x03FF7154400000000 # log2e_lead 1.44269180297851562500E+00
+ .quad 0x03FF7154400000000
+.L__real_log2e_tail : .quad 0x03ECB295C17F0BBBE # log2e_tail 3.23791044778235969970E-06
+ .quad 0x03ECB295C17F0BBBE
+.L__mask_lower: .quad 0x0ffffffff00000000
+ .quad 0x0ffffffff00000000
+
+.L__np_ln_lead_table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3f8fc0a800000000 # 1.55041813850402832031e-02
+ .quad 0x3f9f829800000000 # 3.07716131210327148438e-02
+ .quad 0x3fa7745800000000 # 4.58095073699951171875e-02
+ .quad 0x3faf0a3000000000 # 6.06245994567871093750e-02
+ .quad 0x3fb341d700000000 # 7.52233862876892089844e-02
+ .quad 0x3fb6f0d200000000 # 8.96121263504028320312e-02
+ .quad 0x3fba926d00000000 # 1.03796780109405517578e-01
+ .quad 0x3fbe270700000000 # 1.17783010005950927734e-01
+ .quad 0x3fc0d77e00000000 # 1.31576299667358398438e-01
+ .quad 0x3fc2955280000000 # 1.45181953907012939453e-01
+ .quad 0x3fc44d2b00000000 # 1.58604979515075683594e-01
+ .quad 0x3fc5ff3000000000 # 1.71850204467773437500e-01
+ .quad 0x3fc7ab8900000000 # 1.84922337532043457031e-01
+ .quad 0x3fc9525a80000000 # 1.97825729846954345703e-01
+ .quad 0x3fcaf3c900000000 # 2.10564732551574707031e-01
+ .quad 0x3fcc8ff780000000 # 2.23143517971038818359e-01
+ .quad 0x3fce270700000000 # 2.35566020011901855469e-01
+ .quad 0x3fcfb91800000000 # 2.47836112976074218750e-01
+ .quad 0x3fd0a324c0000000 # 2.59957492351531982422e-01
+ .quad 0x3fd1675c80000000 # 2.71933674812316894531e-01
+ .quad 0x3fd22941c0000000 # 2.83768117427825927734e-01
+ .quad 0x3fd2e8e280000000 # 2.95464158058166503906e-01
+ .quad 0x3fd3a64c40000000 # 3.07025015354156494141e-01
+ .quad 0x3fd4618bc0000000 # 3.18453729152679443359e-01
+ .quad 0x3fd51aad80000000 # 3.29753279685974121094e-01
+ .quad 0x3fd5d1bd80000000 # 3.40926527976989746094e-01
+ .quad 0x3fd686c800000000 # 3.51976394653320312500e-01
+ .quad 0x3fd739d7c0000000 # 3.62905442714691162109e-01
+ .quad 0x3fd7eaf800000000 # 3.73716354370117187500e-01
+ .quad 0x3fd89a3380000000 # 3.84411692619323730469e-01
+ .quad 0x3fd9479400000000 # 3.94993782043457031250e-01
+ .quad 0x3fd9f323c0000000 # 4.05465066432952880859e-01
+ .quad 0x3fda9cec80000000 # 4.15827870368957519531e-01
+ .quad 0x3fdb44f740000000 # 4.26084339618682861328e-01
+ .quad 0x3fdbeb4d80000000 # 4.36236739158630371094e-01
+ .quad 0x3fdc8ff7c0000000 # 4.46287095546722412109e-01
+ .quad 0x3fdd32fe40000000 # 4.56237375736236572266e-01
+ .quad 0x3fddd46a00000000 # 4.66089725494384765625e-01
+ .quad 0x3fde744240000000 # 4.75845873355865478516e-01
+ .quad 0x3fdf128f40000000 # 4.85507786273956298828e-01
+ .quad 0x3fdfaf5880000000 # 4.95077252388000488281e-01
+ .quad 0x3fe02552a0000000 # 5.04556000232696533203e-01
+ .quad 0x3fe0723e40000000 # 5.13945698738098144531e-01
+ .quad 0x3fe0be72e0000000 # 5.23248136043548583984e-01
+ .quad 0x3fe109f380000000 # 5.32464742660522460938e-01
+ .quad 0x3fe154c3c0000000 # 5.41597247123718261719e-01
+ .quad 0x3fe19ee6a0000000 # 5.50647079944610595703e-01
+ .quad 0x3fe1e85f40000000 # 5.59615731239318847656e-01
+ .quad 0x3fe23130c0000000 # 5.68504691123962402344e-01
+ .quad 0x3fe2795e00000000 # 5.77315330505371093750e-01
+ .quad 0x3fe2c0e9e0000000 # 5.86049020290374755859e-01
+ .quad 0x3fe307d720000000 # 5.94707071781158447266e-01
+ .quad 0x3fe34e2880000000 # 6.03290796279907226562e-01
+ .quad 0x3fe393e0c0000000 # 6.11801505088806152344e-01
+ .quad 0x3fe3d90260000000 # 6.20240390300750732422e-01
+ .quad 0x3fe41d8fe0000000 # 6.28608644008636474609e-01
+ .quad 0x3fe4618bc0000000 # 6.36907458305358886719e-01
+ .quad 0x3fe4a4f840000000 # 6.45137906074523925781e-01
+ .quad 0x3fe4e7d800000000 # 6.53301239013671875000e-01
+ .quad 0x3fe52a2d20000000 # 6.61398470401763916016e-01
+ .quad 0x3fe56bf9c0000000 # 6.69430613517761230469e-01
+ .quad 0x3fe5ad4040000000 # 6.77398800849914550781e-01
+ .quad 0x3fe5ee02a0000000 # 6.85303986072540283203e-01
+ .quad 0x3fe62e42e0000000 # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_tail_table:
+ .quad 0x00000000000000000 # 0 ; 0.00000000000000000000e+00
+ .quad 0x03e361f807c79f3db # 5.15092497094772879206e-09
+ .quad 0x03e6873c1980267c8 # 4.55457209735272790188e-08
+ .quad 0x03e5ec65b9f88c69e # 2.86612990859791781788e-08
+ .quad 0x03e58022c54cc2f99 # 2.23596477332056055352e-08
+ .quad 0x03e62c37a3a125330 # 3.49498983167142274770e-08
+ .quad 0x03e615cad69737c93 # 3.23392843005887000414e-08
+ .quad 0x03e4d256ab1b285e9 # 1.35722380472479366661e-08
+ .quad 0x03e5b8abcb97a7aa2 # 2.56504325268044191098e-08
+ .quad 0x03e6f34239659a5dc # 5.81213608741512136843e-08
+ .quad 0x03e6e07fd48d30177 # 5.59374849578288093334e-08
+ .quad 0x03e6b32df4799f4f6 # 5.06615629004996189970e-08
+ .quad 0x03e6c29e4f4f21cf8 # 5.24588857848400955725e-08
+ .quad 0x03e1086c848df1b59 # 9.61968535632653505972e-10
+ .quad 0x03e4cf456b4764130 # 1.34829655346594463137e-08
+ .quad 0x03e63a02ffcb63398 # 3.65557749306383026498e-08
+ .quad 0x03e61e6a6886b0976 # 3.33431709374069198903e-08
+ .quad 0x03e6b8abcb97a7aa2 # 5.13008650536088382197e-08
+ .quad 0x03e6b578f8aa35552 # 5.09285070380306053751e-08
+ .quad 0x03e6139c871afb9fc # 3.20853940845502057341e-08
+ .quad 0x03e65d5d30701ce64 # 4.06713248643004200446e-08
+ .quad 0x03e6de7bcb2d12142 # 5.57028186706125221168e-08
+ .quad 0x03e6d708e984e1664 # 5.48356693724804282546e-08
+ .quad 0x03e556945e9c72f36 # 1.99407553679345001938e-08
+ .quad 0x03e20e2f613e85bda # 1.96585517245087232086e-09
+ .quad 0x03e3cb7e0b42724f6 # 6.68649386072067321503e-09
+ .quad 0x03e6fac04e52846c7 # 5.89936034642113390002e-08
+ .quad 0x03e5e9b14aec442be # 2.85038578721554472484e-08
+ .quad 0x03e6b5de8034e7126 # 5.09746772910284482606e-08
+ .quad 0x03e6dc157e1b259d3 # 5.54234668933210171467e-08
+ .quad 0x03e3b05096ad69c62 # 6.29100830926604004874e-09
+ .quad 0x03e5c2116faba4cdd # 2.61974119468563937716e-08
+ .quad 0x03e665fcc25f95b47 # 4.16752115011186398935e-08
+ .quad 0x03e5a9a08498d4850 # 2.47747534460820790327e-08
+ .quad 0x03e6de647b1465f77 # 5.56922172017964209793e-08
+ .quad 0x03e5da71b7bf7861d # 2.76162876992552906035e-08
+ .quad 0x03e3e6a6886b09760 # 7.08169709942321478061e-09
+ .quad 0x03e6f0075eab0ef64 # 5.77453510221151779025e-08
+ .quad 0x03e33071282fb989b # 4.43021445893361960146e-09
+ .quad 0x03e60eb43c3f1bed2 # 3.15140984357495864573e-08
+ .quad 0x03e5faf06ecb35c84 # 2.95077445089736670973e-08
+ .quad 0x03e4ef1e63db35f68 # 1.44098510263167149349e-08
+ .quad 0x03e469743fb1a71a5 # 1.05196987538551827693e-08
+ .quad 0x03e6c1cdf404e5796 # 5.23641361722697546261e-08
+ .quad 0x03e4094aa0ada625e # 7.72099925253243069458e-09
+ .quad 0x03e6e2d4c96fde3ec # 5.62089493829364197156e-08
+ .quad 0x03e62f4d5e9a98f34 # 3.53090261098577946927e-08
+ .quad 0x03e6467c96ecc5cbe # 3.80080516835568242269e-08
+ .quad 0x03e6e7040d03dec5a # 5.66961038386146408282e-08
+ .quad 0x03e67bebf4282de36 # 4.42287063097349852717e-08
+ .quad 0x03e6289b11aeb783f # 3.45294525105681104660e-08
+ .quad 0x03e5a891d1772f538 # 2.47132034530447431509e-08
+ .quad 0x03e634f10be1fb591 # 3.59655343422487209774e-08
+ .quad 0x03e6d9ce1d316eb93 # 5.51581770357780862071e-08
+ .quad 0x03e63562a19a9c442 # 3.60171867511861372793e-08
+ .quad 0x03e54e2adf548084c # 1.94511067964296180547e-08
+ .quad 0x03e508ce55cc8c97a # 1.54137376631349347838e-08
+ .quad 0x03e30e2f613e85bda # 3.93171034490174464173e-09
+ .quad 0x03e6db03ebb0227bf # 5.52990607758839766440e-08
+ .quad 0x03e61b75bb09cb098 # 3.29990737637586136511e-08
+ .quad 0x03e496f16abb9df22 # 1.18436010922446096216e-08
+ .quad 0x03e65b3f399411c62 # 4.04248680368301346709e-08
+ .quad 0x03e586b3e59f65355 # 2.27418915900284316293e-08
+ .quad 0x03e52482ceae1ac12 # 1.70263791333409206020e-08
+ .quad 0x03e6efa39ef35793c # 5.76999904754328540596e-08
+ .quad 0 # for alignment
+
diff --git a/src/gas/vrdalogr.S b/src/gas/vrdalogr.S
new file mode 100644
index 0000000..4064fb3
--- /dev/null
+++ b/src/gas/vrdalogr.S
@@ -0,0 +1,2428 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdalog.asm
+#
+# An array implementation of the log libm function.
+#
+# Prototype:
+#
+# void vrda_logr(int n, double *x, double *y);
+#
+# Computes the natural log of x.
+# A reduced precision routine. Uses the intel novel reduction technique
+# with frcpa. Also uses only 3 polynomial terms to acheive52-18= 34 significant digits
+#
+# This specialized routine does not handle negative numbers, 0, NaNs, or infin ity.
+# This routine is not C99 compliant
+# This version can compute logs in 26
+# cycles with n <= 24
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ p_x,0 # temporary for error checking operation
+
+.equ p_x2,0x030 # temporary for error checking operation
+
+.equ save_xa,0x060 #qword
+.equ save_ya,0x068 #qword
+.equ save_nv,0x070 #qword
+.equ p_iter,0x078 # qword storage for number of loop iterations
+
+.equ save_rbx,0x080 #qword
+.equ save_rdi,0x088 #qword
+
+.equ save_rsi,0x090 #qword
+
+
+
+.equ p2_temp,0x0e0 # second temporary for get/put bits operation
+.equ p2_temp1,0x0f0 # second temporary for exponent multiply
+
+
+
+.equ stack_size,0x0118
+
+ .weak vrda_logr_
+ .set vrda_logr_,__vrda_logr__
+ .weak vrda_logr__
+ .set vrda_logr__,__vrda_logr__
+
+# parameters are passed in by Linux as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array log
+#** VRDA_LOGR(N,X,Y)
+#** C equivalent
+#*/
+#void vrda_logr_(int * n, double *x, double *y)
+#{
+# vrda_logr(*n,x,y);
+#}
+.globl __vrda_logr__
+ .type __vrda_logr__,@function
+__vrda_logr__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+.globl vrda_logr
+ .type vrda_logr,@function
+vrda_logr:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vda_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 2 values at a time.
+
+.L__vda_top:
+# build the input _m128d
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlpd (%rsi),%xmm0
+ movhpd 8(%rsi),%xmm0
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+ movlpd -16(%rsi),%xmm1
+ movhpd -8(%rsi),%xmm1
+# compute the logs
+
+## if NaN or inf
+ movdqa %xmm0,p_x(%rsp) # save the input values
+
+# use the algorithm referenced in the itanic trancendental paper.
+
+# reduction
+# compute r = x frcpa(x) - 1
+ movdqa %xmm0,%xmm8
+ movdqa %xmm1,%xmm9
+
+ call __vrd4_frcpa@PLT
+ movdqa %xmm8,%xmm4
+ movdqa %xmm9,%xmm7
+# invert the exponent
+ psllq $1,%xmm8
+ psllq $1,%xmm9
+ mulpd %xmm0,%xmm4 # r
+ mulpd %xmm1,%xmm7 # r
+ movdqa %xmm8,%xmm5
+ paddq .L__mask_rup(%rip),%xmm8
+ psrlq $53,%xmm8
+ movdqa %xmm9,%xmm6
+ paddq .L__mask_rup(%rip),%xmm6
+ psrlq $53,%xmm6
+ psubq .L__mask_3ff(%rip),%xmm8
+ psubq .L__mask_3ff(%rip),%xmm6
+ pshufd $0x058,%xmm8,%xmm8
+ pshufd $0x058,%xmm6,%xmm6
+
+
+ subpd .L__real_one(%rip),%xmm4
+ subpd .L__real_one(%rip),%xmm7
+
+ cvtdq2pd %xmm8,%xmm0 #N
+ cvtdq2pd %xmm6,%xmm1 #N
+
+# compute index for table lookup. if 1/2 bit set, increment the index+exponent
+ psrlq $42,%xmm5
+ psrlq $42,%xmm9
+ paddq .L__int_one(%rip),%xmm5
+ paddq .L__int_one(%rip),%xmm9
+ psrlq $1,%xmm5
+ psrlq $1,%xmm9
+ pand .L__mask_3ff(%rip),%xmm5
+ pand .L__mask_3ff(%rip),%xmm9
+ psllq $1,%xmm5
+ psllq $1,%xmm9
+
+ movdqa %xmm5,p_x(%rsp) # move the indexes to a memory location
+ movdqa %xmm9,p_x2(%rsp)
+
+
+ movapd .L__real_third(%rip),%xmm3
+ movdqa %xmm3,%xmm5
+ movapd %xmm4,%xmm2
+ movapd %xmm7,%xmm8
+
+# approximation
+# compute the polynomial
+# p(r) = p1r^2+p2r^3+p3r^4+p4r^5
+
+ mulpd %xmm4,%xmm2 #r^2
+ mulpd %xmm7,%xmm8 #r^2
+
+# eliminating the 4th and 5th terms gets us to 8000ulps, or 53-16=37 significant digits
+# The routine runs in 60 cycles.
+ mulpd %xmm4,%xmm3 # 1/3r
+ mulpd %xmm7,%xmm5 # 1/3r
+# lookup the f(k) term
+ lea .L__np_lnf_table(%rip),%rdx
+ mov p_x(%rsp),%rcx
+ mov p_x+8(%rsp),%r9
+ movlpd (%rdx,%rcx,8),%xmm6 # lookup
+ movhpd (%rdx,%r9,8),%xmm6 # lookup
+
+ addpd .L__real_half(%rip),%xmm3 # p2 + p3r
+ addpd .L__real_half(%rip),%xmm5 # p2 + p3r
+
+ mov p_x2(%rsp),%rcx
+ mov p_x2+8(%rsp),%r9
+ movlpd (%rdx,%rcx,8),%xmm9 # lookup
+ movhpd (%rdx,%r9,8),%xmm9 # lookup
+
+ mulpd %xmm3,%xmm2 # r2(p2 + p3r)
+ mulpd %xmm5,%xmm8 # r2(p2 + p3r)
+ addpd %xmm4,%xmm2 # +r
+ addpd %xmm7,%xmm8 # +r
+
+
+# reconstruction
+# compute ln(x) = T + r + p(r) where
+# T = N*ln(2)+ln(1/frcpa(x)) via tab of ln(1/frcpa(y)), where y = 1 + k/256, 0<=k<=255
+
+ mulpd .L__real_log2(%rip),%xmm0 # compute N*__real_log2
+ mulpd .L__real_log2(%rip),%xmm1 # compute N*__real_log2
+ addpd %xmm6,%xmm2 # add the new mantissas
+ addpd %xmm9,%xmm8 # add the new mantissas
+ addpd %xmm2,%xmm0
+ addpd %xmm8,%xmm1
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+
+
+ prefetch 64(%rdi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movlpd %xmm1,-16(%rdi)
+ movhpd %xmm1,-8(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vda_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vda_cleanup
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+# we jump here when we have an odd number of log calls to make at the
+# end
+# we assume that rdx is pointing at the next x array element,
+# r8 at the next y array element. The number of values left is in
+# save_nv
+.L__vda_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__finish # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorpd %xmm0,%xmm0
+ movlpd %xmm0,p_x+8(%rsp)
+ movapd %xmm0,p_x+16(%rsp)
+
+ mov (%rsi),%rcx # we know there's at least one
+ mov %rcx,p_x(%rsp)
+ cmp $2,%rax
+ jl .L__vdacg
+
+ mov 8(%rsi),%rcx # do the second value
+ mov %rcx,p_x+8(%rsp)
+ cmp $3,%rax
+ jl .L__vdacg
+
+ mov 16(%rsi),%rcx # do the third value
+ mov %rcx,p_x+16(%rsp)
+
+.L__vdacg:
+ mov $4,%rcx # parameter for N
+ lea p_x(%rsp),%rdx # &x parameter
+ lea p2_temp(%rsp),%r8 # &y parameter
+ call vrda_logr@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p2_temp(%rsp),%rcx
+ mov %rcx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+8(%rsp),%rcx
+ mov %rcx,8(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vdacgf
+
+ mov p2_temp+16(%rsp),%rcx
+ mov %rcx,16(%rdi) # do the third value
+
+.L__vdacgf:
+ jmp .L__finish
+
+ .data
+ .align 64
+
+.L__real_one: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+
+.L__real_half: .quad 0x0bfe0000000000000 # 1/2
+ .quad 0x0bfe0000000000000
+.L__real_third: .quad 0x03fd5555555555555 # 1/3
+ .quad 0x03fd5555555555555
+.L__real_fourth: .quad 0x0bfd0000000000000 # 1/4
+ .quad 0x0bfd0000000000000
+.L__real_fifth: .quad 0x03fc999999999999a # 1/5
+ .quad 0x03fc999999999999a
+.L__real_sixth: .quad 0x0bfc5555555555555 # 1/6
+ .quad 0x0bfc5555555555555
+
+.L__real_log2: .quad 0x03FE62E42FEFA39EF # 0.693147182465
+ .quad 0x03FE62E42FEFA39EF
+
+.L__mask_3ff: .quad 0x000000000000003ff #
+ .quad 0x000000000000003ff
+
+.L__mask_rup: .quad 0x0000003fffffffffe
+ .quad 0x0000003fffffffffe
+
+.L__int_one: .quad 0x00000000000000001
+ .quad 0x00000000000000001
+
+
+
+.L__mask_10bits: .quad 0x000000000000003ff
+ .quad 0x000000000000003ff
+
+.L__mask_expext: .quad 0x000000000003ff000
+ .quad 0x000000000003ff000
+
+.L__mask_expext2: .quad 0x000000000003ff800
+ .quad 0x000000000003ff800
+
+
+
+
+.L__np_lnf_table:
+#log table Program - logtab.c
+#Built Jan 18 2006 09:51:57
+#Compiler version 1400
+
+ .quad 0x00000000000000000 # 0.000000000000 0
+ .quad 0x00000000000000000
+ .quad 0x03F50020055655885 # 0.000977039648 1
+ .quad 0x03F50020055655885
+ .quad 0x03F60040155D5881E # 0.001955034836 2
+ .quad 0x03F60040155D5881E
+ .quad 0x03F6809048289860A # 0.002933987435 3
+ .quad 0x03F6809048289860A
+ .quad 0x03F70080559588B25 # 0.003913899321 4
+ .quad 0x03F70080559588B25
+ .quad 0x03F740C8A7478788D # 0.004894772377 5
+ .quad 0x03F740C8A7478788D
+ .quad 0x03F78121214586B02 # 0.005876608489 6
+ .quad 0x03F78121214586B02
+ .quad 0x03F7C189CBB0E283F # 0.006859409551 7
+ .quad 0x03F7C189CBB0E283F
+ .quad 0x03F8010157588DE69 # 0.007843177461 8
+ .quad 0x03F8010157588DE69
+ .quad 0x03F82145E939EF1BC # 0.008827914124 9
+ .quad 0x03F82145E939EF1BC
+ .quad 0x03F83D8896A83D7A8 # 0.009690354884 10
+ .quad 0x03F83D8896A83D7A8
+ .quad 0x03F85DDC705054DFF # 0.010676913110 11
+ .quad 0x03F85DDC705054DFF
+ .quad 0x03F87E38762CA0C6D # 0.011664445593 12
+ .quad 0x03F87E38762CA0C6D
+ .quad 0x03F89E9CAC6007563 # 0.012652954261 13
+ .quad 0x03F89E9CAC6007563
+ .quad 0x03F8BF091710935A4 # 0.013642441046 14
+ .quad 0x03F8BF091710935A4
+ .quad 0x03F8DF7DBA6777895 # 0.014632907884 15
+ .quad 0x03F8DF7DBA6777895
+ .quad 0x03F8FBEA8B13C03F9 # 0.015500371846 16
+ .quad 0x03F8FBEA8B13C03F9
+ .quad 0x03F90E3751F24F45C # 0.016492681528 17
+ .quad 0x03F90E3751F24F45C
+ .quad 0x03F91E7D80B1FBF4C # 0.017485976867 18
+ .quad 0x03F91E7D80B1FBF4C
+ .quad 0x03F92CBE4F6CC56C3 # 0.018355920375 19
+ .quad 0x03F92CBE4F6CC56C3
+ .quad 0x03F93D0C443D7258C # 0.019351069108 20
+ .quad 0x03F93D0C443D7258C
+ .quad 0x03F94D5E6176ACC89 # 0.020347209148 21
+ .quad 0x03F94D5E6176ACC89
+ .quad 0x03F95DB4A937DEF10 # 0.021344342472 22
+ .quad 0x03F95DB4A937DEF10
+ .quad 0x03F96C039490E37F4 # 0.022217650494 23
+ .quad 0x03F96C039490E37F4
+ .quad 0x03F97C61B1CF5DED7 # 0.023216651576 24
+ .quad 0x03F97C61B1CF5DED7
+ .quad 0x03F98AB77B3FD6EAD # 0.024091596947 25
+ .quad 0x03F98AB77B3FD6EAD
+ .quad 0x03F99B1D75828E780 # 0.025092472797 26
+ .quad 0x03F99B1D75828E780
+ .quad 0x03F9AB87A478CB7CB # 0.026094351403 27
+ .quad 0x03F9AB87A478CB7CB
+ .quad 0x03F9B9E8027E1916F # 0.026971819338 28
+ .quad 0x03F9B9E8027E1916F
+ .quad 0x03F9CA5A1A18613E6 # 0.027975583538 29
+ .quad 0x03F9CA5A1A18613E6
+ .quad 0x03F9D8C1670325921 # 0.028854704473 30
+ .quad 0x03F9D8C1670325921
+ .quad 0x03F9E93B6EE41F674 # 0.029860361378 31
+ .quad 0x03F9E93B6EE41F674
+ .quad 0x03F9F7A9B16782855 # 0.030741141554 32
+ .quad 0x03F9F7A9B16782855
+ .quad 0x03FA0415D89E74440 # 0.031748698315 33
+ .quad 0x03FA0415D89E74440
+ .quad 0x03FA0C58FA19DFAAB # 0.032757271269 34
+ .quad 0x03FA0C58FA19DFAAB
+ .quad 0x03FA139577CC41C1A # 0.033640607815 35
+ .quad 0x03FA139577CC41C1A
+ .quad 0x03FA1AD398C6CD57C # 0.034524725334 36
+ .quad 0x03FA1AD398C6CD57C
+ .quad 0x03FA231C9C40E204E # 0.035536103423 37
+ .quad 0x03FA231C9C40E204E
+ .quad 0x03FA2A5E4231CF7BD # 0.036421899115 38
+ .quad 0x03FA2A5E4231CF7BD
+ .quad 0x03FA32AB4D4C59CB0 # 0.037435198758 39
+ .quad 0x03FA32AB4D4C59CB0
+ .quad 0x03FA39F07BA0EBD5A # 0.038322679007 40
+ .quad 0x03FA39F07BA0EBD5A
+ .quad 0x03FA424192495D571 # 0.039337907520 41
+ .quad 0x03FA424192495D571
+ .quad 0x03FA498A4C73DA65D # 0.040227078744 42
+ .quad 0x03FA498A4C73DA65D
+ .quad 0x03FA50D4AF75CA86F # 0.041117041297 43
+ .quad 0x03FA50D4AF75CA86F
+ .quad 0x03FA592BBC15215BC # 0.042135112141 44
+ .quad 0x03FA592BBC15215BC
+ .quad 0x03FA6079B00423FF6 # 0.043026775152 45
+ .quad 0x03FA6079B00423FF6
+ .quad 0x03FA67C94F2D4BB65 # 0.043919233935 46
+ .quad 0x03FA67C94F2D4BB65
+ .quad 0x03FA70265A550E77B # 0.044940163069 47
+ .quad 0x03FA70265A550E77B
+ .quad 0x03FA77798F8D6DFDC # 0.045834331871 48
+ .quad 0x03FA77798F8D6DFDC
+ .quad 0x03FA7ECE7267CD123 # 0.046729300926 49
+ .quad 0x03FA7ECE7267CD123
+ .quad 0x03FA873184BC09586 # 0.047753104446 50
+ .quad 0x03FA873184BC09586
+ .quad 0x03FA8E8A02D2E3175 # 0.048649793163 51
+ .quad 0x03FA8E8A02D2E3175
+ .quad 0x03FA95E430F8CE456 # 0.049547286652 52
+ .quad 0x03FA95E430F8CE456
+ .quad 0x03FA9D400FF482586 # 0.050445586359 53
+ .quad 0x03FA9D400FF482586
+ .quad 0x03FAA5AB21CB34A9E # 0.051473203662 54
+ .quad 0x03FAA5AB21CB34A9E
+ .quad 0x03FAAD0AA2E784EF4 # 0.052373235867 55
+ .quad 0x03FAAD0AA2E784EF4
+ .quad 0x03FAB46BD74DA76A0 # 0.053274078860 56
+ .quad 0x03FAB46BD74DA76A0
+ .quad 0x03FABBCEBFC68F424 # 0.054175734102 57
+ .quad 0x03FABBCEBFC68F424
+ .quad 0x03FAC3335D1BBAE4D # 0.055078203060 58
+ .quad 0x03FAC3335D1BBAE4D
+ .quad 0x03FACBA87200EB8F1 # 0.056110594428 59
+ .quad 0x03FACBA87200EB8F1
+ .quad 0x03FAD310BA20455A2 # 0.057014812019 60
+ .quad 0x03FAD310BA20455A2
+ .quad 0x03FADA7AB998B77ED # 0.057919847959 61
+ .quad 0x03FADA7AB998B77ED
+ .quad 0x03FAE1E6713606CFB # 0.058825703731 62
+ .quad 0x03FAE1E6713606CFB
+ .quad 0x03FAE953E1C48603A # 0.059732380822 63
+ .quad 0x03FAE953E1C48603A
+ .quad 0x03FAF0C30C1116351 # 0.060639880722 64
+ .quad 0x03FAF0C30C1116351
+ .quad 0x03FAF833F0E927711 # 0.061548204926 65
+ .quad 0x03FAF833F0E927711
+ .quad 0x03FAFFA6911AB9309 # 0.062457354934 66
+ .quad 0x03FAFFA6911AB9309
+ .quad 0x03FB038D76BA2D737 # 0.063367332247 67
+ .quad 0x03FB038D76BA2D737
+ .quad 0x03FB0748836296412 # 0.064278138373 68
+ .quad 0x03FB0748836296412
+ .quad 0x03FB0B046EEE6F7A4 # 0.065189774824 69
+ .quad 0x03FB0B046EEE6F7A4
+ .quad 0x03FB0EC139C5DA5FD # 0.066102243114 70
+ .quad 0x03FB0EC139C5DA5FD
+ .quad 0x03FB127EE451413A8 # 0.067015544762 71
+ .quad 0x03FB127EE451413A8
+ .quad 0x03FB163D6EF9579FC # 0.067929681294 72
+ .quad 0x03FB163D6EF9579FC
+ .quad 0x03FB19FCDA271ABC0 # 0.068844654235 73
+ .quad 0x03FB19FCDA271ABC0
+ .quad 0x03FB1DBD2643D1912 # 0.069760465119 74
+ .quad 0x03FB1DBD2643D1912
+ .quad 0x03FB217E53B90D3CE # 0.070677115481 75
+ .quad 0x03FB217E53B90D3CE
+ .quad 0x03FB254062F0A9417 # 0.071594606862 76
+ .quad 0x03FB254062F0A9417
+ .quad 0x03FB29035454CBCB0 # 0.072512940806 77
+ .quad 0x03FB29035454CBCB0
+ .quad 0x03FB2CC7284FE5F1A # 0.073432118863 78
+ .quad 0x03FB2CC7284FE5F1A
+ .quad 0x03FB308BDF4CB4062 # 0.074352142586 79
+ .quad 0x03FB308BDF4CB4062
+ .quad 0x03FB345179B63DD3F # 0.075273013532 80
+ .quad 0x03FB345179B63DD3F
+ .quad 0x03FB3817F7F7D6EAB # 0.076194733263 81
+ .quad 0x03FB3817F7F7D6EAB
+ .quad 0x03FB3BDF5A7D1EE5E # 0.077117303344 82
+ .quad 0x03FB3BDF5A7D1EE5E
+ .quad 0x03FB3F1D405CE86D3 # 0.077908755701 83
+ .quad 0x03FB3F1D405CE86D3
+ .quad 0x03FB42E64BEC266E4 # 0.078832909176 84
+ .quad 0x03FB42E64BEC266E4
+ .quad 0x03FB46B03CF437BC4 # 0.079757917501 85
+ .quad 0x03FB46B03CF437BC4
+ .quad 0x03FB4A7B13E1E3E65 # 0.080683782259 86
+ .quad 0x03FB4A7B13E1E3E65
+ .quad 0x03FB4E46D1223FE84 # 0.081610505036 87
+ .quad 0x03FB4E46D1223FE84
+ .quad 0x03FB52137522AE732 # 0.082538087426 88
+ .quad 0x03FB52137522AE732
+ .quad 0x03FB5555DE434F2A0 # 0.083333843436 89
+ .quad 0x03FB5555DE434F2A0
+ .quad 0x03FB59242FF043D34 # 0.084263026485 90
+ .quad 0x03FB59242FF043D34
+ .quad 0x03FB5CF36997817B2 # 0.085193073719 91
+ .quad 0x03FB5CF36997817B2
+ .quad 0x03FB60C38BA799459 # 0.086123986746 92
+ .quad 0x03FB60C38BA799459
+ .quad 0x03FB6408F471C82A2 # 0.086922602521 93
+ .quad 0x03FB6408F471C82A2
+ .quad 0x03FB67DAC7466CB96 # 0.087855127734 94
+ .quad 0x03FB67DAC7466CB96
+ .quad 0x03FB6BAD83C1883BA # 0.088788523361 95
+ .quad 0x03FB6BAD83C1883BA
+ .quad 0x03FB6EF528C056A2D # 0.089589270768 96
+ .quad 0x03FB6EF528C056A2D
+ .quad 0x03FB72C9985035BB1 # 0.090524287199 97
+ .quad 0x03FB72C9985035BB1
+ .quad 0x03FB769EF2C6B5688 # 0.091460178704 98
+ .quad 0x03FB769EF2C6B5688
+ .quad 0x03FB79E8D70A364C6 # 0.092263069152 99
+ .quad 0x03FB79E8D70A364C6
+ .quad 0x03FB7DBFE6EA733FE # 0.093200590148 100
+ .quad 0x03FB7DBFE6EA733FE
+ .quad 0x03FB8197E2F40E3F0 # 0.094138990914 101
+ .quad 0x03FB8197E2F40E3F0
+ .quad 0x03FB84E40992A4804 # 0.094944035906 102
+ .quad 0x03FB84E40992A4804
+ .quad 0x03FB88BDBD5FC66D2 # 0.095884074919 103
+ .quad 0x03FB88BDBD5FC66D2
+ .quad 0x03FB8C985E9B9EC7E # 0.096824998438 104
+ .quad 0x03FB8C985E9B9EC7E
+ .quad 0x03FB8FE6CAB20E979 # 0.097632209567 105
+ .quad 0x03FB8FE6CAB20E979
+ .quad 0x03FB93C3261014C65 # 0.098574780162 106
+ .quad 0x03FB93C3261014C65
+ .quad 0x03FB97130DC9235DE # 0.099383405543 107
+ .quad 0x03FB97130DC9235DE
+ .quad 0x03FB9AF124D64C623 # 0.100327628989 108
+ .quad 0x03FB9AF124D64C623
+ .quad 0x03FB9E4289871E964 # 0.101137673586 109
+ .quad 0x03FB9E4289871E964
+ .quad 0x03FBA2225DD276FCB # 0.102083555691 110
+ .quad 0x03FBA2225DD276FCB
+ .quad 0x03FBA57540D1FE441 # 0.102895024494 111
+ .quad 0x03FBA57540D1FE441
+ .quad 0x03FBA956D3ECADE60 # 0.103842571097 112
+ .quad 0x03FBA956D3ECADE60
+ .quad 0x03FBACAB3693AB9C0 # 0.104655469123 113
+ .quad 0x03FBACAB3693AB9C0
+ .quad 0x03FBB08E8A10F96F4 # 0.105604686090 114
+ .quad 0x03FBB08E8A10F96F4
+ .quad 0x03FBB3E46DBA02181 # 0.106419018383 115
+ .quad 0x03FBB3E46DBA02181
+ .quad 0x03FBB7C9832F58018 # 0.107369911615 116
+ .quad 0x03FBB7C9832F58018
+ .quad 0x03FBBB20E936D6976 # 0.108185683244 117
+ .quad 0x03FBBB20E936D6976
+ .quad 0x03FBBF07C23BC54EA # 0.109138258671 118
+ .quad 0x03FBBF07C23BC54EA
+ .quad 0x03FBC260ABFFFE972 # 0.109955474734 119
+ .quad 0x03FBC260ABFFFE972
+ .quad 0x03FBC6494A2E418A0 # 0.110909738320 120
+ .quad 0x03FBC6494A2E418A0
+ .quad 0x03FBC9A3B90F57748 # 0.111728403941 121
+ .quad 0x03FBC9A3B90F57748
+ .quad 0x03FBCCFEDBFEE13A8 # 0.112547740324 122
+ .quad 0x03FBCCFEDBFEE13A8
+ .quad 0x03FBD0EA1362CDBFC # 0.113504482008 123
+ .quad 0x03FBD0EA1362CDBFC
+ .quad 0x03FBD446BD753D433 # 0.114325275488 124
+ .quad 0x03FBD446BD753D433
+ .quad 0x03FBD7A41C8627307 # 0.115146743223 125
+ .quad 0x03FBD7A41C8627307
+ .quad 0x03FBDB91F09680DF9 # 0.116105975911 126
+ .quad 0x03FBDB91F09680DF9
+ .quad 0x03FBDEF0D8D466DBB # 0.116928908339 127
+ .quad 0x03FBDEF0D8D466DBB
+ .quad 0x03FBE2507702AF03B # 0.117752518544 128
+ .quad 0x03FBE2507702AF03B
+ .quad 0x03FBE640EB3D2B411 # 0.118714255240 129
+ .quad 0x03FBE640EB3D2B411
+ .quad 0x03FBE9A214A69DD58 # 0.119539337795 130
+ .quad 0x03FBE9A214A69DD58
+ .quad 0x03FBED03F4F440969 # 0.120365101673 131
+ .quad 0x03FBED03F4F440969
+ .quad 0x03FBF0F70CDD992E4 # 0.121329355484 132
+ .quad 0x03FBF0F70CDD992E4
+ .quad 0x03FBF45A7A78B7C3B # 0.122156599431 133
+ .quad 0x03FBF45A7A78B7C3B
+ .quad 0x03FBF7BE9FEDBFDED # 0.122984528276 134
+ .quad 0x03FBF7BE9FEDBFDED
+ .quad 0x03FBFB237D8AB13FB # 0.123813143156 135
+ .quad 0x03FBFB237D8AB13FB
+ .quad 0x03FBFF1A13EAC95FD # 0.124780729104 136
+ .quad 0x03FBFF1A13EAC95FD
+ .quad 0x03FC014040CAB0229 # 0.125610834299 137
+ .quad 0x03FC014040CAB0229
+ .quad 0x03FC02F3D4301417B # 0.126441629140 138
+ .quad 0x03FC02F3D4301417B
+ .quad 0x03FC04A7C44CF87A4 # 0.127273114776 139
+ .quad 0x03FC04A7C44CF87A4
+ .quad 0x03FC06A4D1D26C5E9 # 0.128244055971 140
+ .quad 0x03FC06A4D1D26C5E9
+ .quad 0x03FC08598B59E3A07 # 0.129077042275 141
+ .quad 0x03FC08598B59E3A07
+ .quad 0x03FC0A0EA2164AF02 # 0.129910723024 142
+ .quad 0x03FC0A0EA2164AF02
+ .quad 0x03FC0BC4162F73B66 # 0.130745099376 143
+ .quad 0x03FC0BC4162F73B66
+ .quad 0x03FC0D79E7CD48E58 # 0.131580172493 144
+ .quad 0x03FC0D79E7CD48E58
+ .quad 0x03FC0F301717CF0FB # 0.132415943541 145
+ .quad 0x03FC0F301717CF0FB
+ .quad 0x03FC10E6A437247B7 # 0.133252413686 146
+ .quad 0x03FC10E6A437247B7
+ .quad 0x03FC12E6BFA8FEAD6 # 0.134229180665 147
+ .quad 0x03FC12E6BFA8FEAD6
+ .quad 0x03FC149E189F8642E # 0.135067169541 148
+ .quad 0x03FC149E189F8642E
+ .quad 0x03FC1655CFEA923A4 # 0.135905861231 149
+ .quad 0x03FC1655CFEA923A4
+ .quad 0x03FC180DE5B2ACE5C # 0.136745256915 150
+ .quad 0x03FC180DE5B2ACE5C
+ .quad 0x03FC19C65A207AC07 # 0.137585357777 151
+ .quad 0x03FC19C65A207AC07
+ .quad 0x03FC1B7F2D5CBA842 # 0.138426165001 152
+ .quad 0x03FC1B7F2D5CBA842
+ .quad 0x03FC1D385F90453F2 # 0.139267679777 153
+ .quad 0x03FC1D385F90453F2
+ .quad 0x03FC1EF1F0E40E6CD # 0.140109903297 154
+ .quad 0x03FC1EF1F0E40E6CD
+ .quad 0x03FC20ABE18124098 # 0.140952836755 155
+ .quad 0x03FC20ABE18124098
+ .quad 0x03FC22663190AEACC # 0.141796481350 156
+ .quad 0x03FC22663190AEACC
+ .quad 0x03FC2420E13BF19E3 # 0.142640838281 157
+ .quad 0x03FC2420E13BF19E3
+ .quad 0x03FC25DBF0AC4AED2 # 0.143485908754 158
+ .quad 0x03FC25DBF0AC4AED2
+ .quad 0x03FC2797600B3387B # 0.144331693975 159
+ .quad 0x03FC2797600B3387B
+ .quad 0x03FC29532F823F525 # 0.145178195155 160
+ .quad 0x03FC29532F823F525
+ .quad 0x03FC2B0F5F3B1D3EF # 0.146025413505 161
+ .quad 0x03FC2B0F5F3B1D3EF
+ .quad 0x03FC2CCBEF5F97653 # 0.146873350243 162
+ .quad 0x03FC2CCBEF5F97653
+ .quad 0x03FC2E88E01993187 # 0.147722006588 163
+ .quad 0x03FC2E88E01993187
+ .quad 0x03FC3046319311009 # 0.148571383763 164
+ .quad 0x03FC3046319311009
+ .quad 0x03FC3203E3F62D328 # 0.149421482992 165
+ .quad 0x03FC3203E3F62D328
+ .quad 0x03FC33C1F76D1F469 # 0.150272305505 166
+ .quad 0x03FC33C1F76D1F469
+ .quad 0x03FC35806C223A70F # 0.151123852534 167
+ .quad 0x03FC35806C223A70F
+ .quad 0x03FC373F423FED9A1 # 0.151976125313 168
+ .quad 0x03FC373F423FED9A1
+ .quad 0x03FC38FE79F0C3771 # 0.152829125080 169
+ .quad 0x03FC38FE79F0C3771
+ .quad 0x03FC3ABE135F62A12 # 0.153682853077 170
+ .quad 0x03FC3ABE135F62A12
+ .quad 0x03FC3C335E0447D71 # 0.154394850259 171
+ .quad 0x03FC3C335E0447D71
+ .quad 0x03FC3DF3AB13505F9 # 0.155249916579 172
+ .quad 0x03FC3DF3AB13505F9
+ .quad 0x03FC3FB45A59928CA # 0.156105714663 173
+ .quad 0x03FC3FB45A59928CA
+ .quad 0x03FC41756C0220C81 # 0.156962245765 174
+ .quad 0x03FC41756C0220C81
+ .quad 0x03FC4336E03829D61 # 0.157819511141 175
+ .quad 0x03FC4336E03829D61
+ .quad 0x03FC44F8B726F8EFE # 0.158677512051 176
+ .quad 0x03FC44F8B726F8EFE
+ .quad 0x03FC46BAF0F9F5DB8 # 0.159536249760 177
+ .quad 0x03FC46BAF0F9F5DB8
+ .quad 0x03FC48326CD3EC797 # 0.160252428262 178
+ .quad 0x03FC48326CD3EC797
+ .quad 0x03FC49F55C6502F81 # 0.161112520058 179
+ .quad 0x03FC49F55C6502F81
+ .quad 0x03FC4BB8AF55DE908 # 0.161973352249 180
+ .quad 0x03FC4BB8AF55DE908
+ .quad 0x03FC4D7C65D25566D # 0.162834926111 181
+ .quad 0x03FC4D7C65D25566D
+ .quad 0x03FC4F4080065AA7F # 0.163697242922 182
+ .quad 0x03FC4F4080065AA7F
+ .quad 0x03FC50B98CD30A759 # 0.164416408720 183
+ .quad 0x03FC50B98CD30A759
+ .quad 0x03FC527E5E4A1B58D # 0.165280090939 184
+ .quad 0x03FC527E5E4A1B58D
+ .quad 0x03FC544393F5DF80F # 0.166144519750 185
+ .quad 0x03FC544393F5DF80F
+ .quad 0x03FC56092E02BA514 # 0.167009696444 186
+ .quad 0x03FC56092E02BA514
+ .quad 0x03FC57837B3098F2C # 0.167731249257 187
+ .quad 0x03FC57837B3098F2C
+ .quad 0x03FC5949CDB873419 # 0.168597800437 188
+ .quad 0x03FC5949CDB873419
+ .quad 0x03FC5B10851FC924A # 0.169465103180 189
+ .quad 0x03FC5B10851FC924A
+ .quad 0x03FC5C8BC079D8289 # 0.170188430518 190
+ .quad 0x03FC5C8BC079D8289
+ .quad 0x03FC5E533144C1718 # 0.171057114516 191
+ .quad 0x03FC5E533144C1718
+ .quad 0x03FC601B076E7A8A8 # 0.171926553783 192
+ .quad 0x03FC601B076E7A8A8
+ .quad 0x03FC619732215D786 # 0.172651664394 193
+ .quad 0x03FC619732215D786
+ .quad 0x03FC635FC298F6C77 # 0.173522491735 194
+ .quad 0x03FC635FC298F6C77
+ .quad 0x03FC6528B8EFA5D16 # 0.174394078077 195
+ .quad 0x03FC6528B8EFA5D16
+ .quad 0x03FC66A5D42A3AD33 # 0.175120980777 196
+ .quad 0x03FC66A5D42A3AD33
+ .quad 0x03FC686F85BAD4298 # 0.175993962063 197
+ .quad 0x03FC686F85BAD4298
+ .quad 0x03FC6A399DABBD383 # 0.176867706111 198
+ .quad 0x03FC6A399DABBD383
+ .quad 0x03FC6BB7AA9F22C40 # 0.177596409780 199
+ .quad 0x03FC6BB7AA9F22C40
+ .quad 0x03FC6D827EB7C1E57 # 0.178471555693 200
+ .quad 0x03FC6D827EB7C1E57
+ .quad 0x03FC6F0128B756AB9 # 0.179201429458 201
+ .quad 0x03FC6F0128B756AB9
+ .quad 0x03FC70CCB9927BCF6 # 0.180077981742 202
+ .quad 0x03FC70CCB9927BCF6
+ .quad 0x03FC7298B1A4E32B6 # 0.180955303044 203
+ .quad 0x03FC7298B1A4E32B6
+ .quad 0x03FC74184F58CC7DC # 0.181686992547 204
+ .quad 0x03FC74184F58CC7DC
+ .quad 0x03FC75E5051E74141 # 0.182565727226 205
+ .quad 0x03FC75E5051E74141
+ .quad 0x03FC77654128F6127 # 0.183298596442 206
+ .quad 0x03FC77654128F6127
+ .quad 0x03FC7932B53E97639 # 0.184178749058 207
+ .quad 0x03FC7932B53E97639
+ .quad 0x03FC7AB390229D8FD # 0.184912801796 208
+ .quad 0x03FC7AB390229D8FD
+ .quad 0x03FC7C81C325B4A5E # 0.185794376934 209
+ .quad 0x03FC7C81C325B4A5E
+ .quad 0x03FC7E033D66CD24A # 0.186529617023 210
+ .quad 0x03FC7E033D66CD24A
+ .quad 0x03FC7FD22FF599D4C # 0.187412619288 211
+ .quad 0x03FC7FD22FF599D4C
+ .quad 0x03FC81544A17F67C1 # 0.188149050576 212
+ .quad 0x03FC81544A17F67C1
+ .quad 0x03FC8323FCD17DAC8 # 0.189033484595 213
+ .quad 0x03FC8323FCD17DAC8
+ .quad 0x03FC84A6B759F512D # 0.189771110947 214
+ .quad 0x03FC84A6B759F512D
+ .quad 0x03FC86772ADE0201C # 0.190656981373 215
+ .quad 0x03FC86772ADE0201C
+ .quad 0x03FC87FA865210911 # 0.191395806674 216
+ .quad 0x03FC87FA865210911
+ .quad 0x03FC89CBBB4136201 # 0.192283118179 217
+ .quad 0x03FC89CBBB4136201
+ .quad 0x03FC8B4FB826FF291 # 0.193023146334 218
+ .quad 0x03FC8B4FB826FF291
+ .quad 0x03FC8D21AF2299298 # 0.193911903613 219
+ .quad 0x03FC8D21AF2299298
+ .quad 0x03FC8EA64E00E7FC0 # 0.194653138545 220
+ .quad 0x03FC8EA64E00E7FC0
+ .quad 0x03FC902B36AB7681D # 0.195394923313 221
+ .quad 0x03FC902B36AB7681D
+ .quad 0x03FC91FE49096581E # 0.196285791969 222
+ .quad 0x03FC91FE49096581E
+ .quad 0x03FC9383D471B869B # 0.197028789254 223
+ .quad 0x03FC9383D471B869B
+ .quad 0x03FC9557AA6B87F65 # 0.197921115309 224
+ .quad 0x03FC9557AA6B87F65
+ .quad 0x03FC96DDD91A0B959 # 0.198665329082 225
+ .quad 0x03FC96DDD91A0B959
+ .quad 0x03FC9864522D04491 # 0.199410097121 226
+ .quad 0x03FC9864522D04491
+ .quad 0x03FC9A3945D1A44B3 # 0.200304551564 227
+ .quad 0x03FC9A3945D1A44B3
+ .quad 0x03FC9BC062F26FC3B # 0.201050541900 228
+ .quad 0x03FC9BC062F26FC3B
+ .quad 0x03FC9D47CAD2C1871 # 0.201797089154 229
+ .quad 0x03FC9D47CAD2C1871
+ .quad 0x03FC9F1DDD7FE4F8B # 0.202693682161 230
+ .quad 0x03FC9F1DDD7FE4F8B
+ .quad 0x03FCA0A5EA371A910 # 0.203441457564 231
+ .quad 0x03FCA0A5EA371A910
+ .quad 0x03FCA22E42098F498 # 0.204189792554 232
+ .quad 0x03FCA22E42098F498
+ .quad 0x03FCA405751F6CCE4 # 0.205088534376 233
+ .quad 0x03FCA405751F6CCE4
+ .quad 0x03FCA58E729348F40 # 0.205838103409 234
+ .quad 0x03FCA58E729348F40
+ .quad 0x03FCA717BB7EC64A3 # 0.206588234717 235
+ .quad 0x03FCA717BB7EC64A3
+ .quad 0x03FCA8F010601E5FD # 0.207489135679 236
+ .quad 0x03FCA8F010601E5FD
+ .quad 0x03FCAA79FFB8FCD48 # 0.208240506966 237
+ .quad 0x03FCAA79FFB8FCD48
+ .quad 0x03FCAC043AE68965A # 0.208992443238 238
+ .quad 0x03FCAC043AE68965A
+ .quad 0x03FCAD8EC205FB6AD # 0.209744945343 239
+ .quad 0x03FCAD8EC205FB6AD
+ .quad 0x03FCAF6895610DBAD # 0.210648695969 240
+ .quad 0x03FCAF6895610DBAD
+ .quad 0x03FCB0F3C3FBD65C9 # 0.211402445910 241
+ .quad 0x03FCB0F3C3FBD65C9
+ .quad 0x03FCB27F3EE674219 # 0.212156764419 242
+ .quad 0x03FCB27F3EE674219
+ .quad 0x03FCB40B063E65B0F # 0.212911652354 243
+ .quad 0x03FCB40B063E65B0F
+ .quad 0x03FCB5E65A8096C88 # 0.213818270730 244
+ .quad 0x03FCB5E65A8096C88
+ .quad 0x03FCB772CA646760C # 0.214574414434 245
+ .quad 0x03FCB772CA646760C
+ .quad 0x03FCB8FF871461198 # 0.215331130323 246
+ .quad 0x03FCB8FF871461198
+ .quad 0x03FCBA8C90AE4AD19 # 0.216088419265 247
+ .quad 0x03FCBA8C90AE4AD19
+ .quad 0x03FCBC19E74FFCBDA # 0.216846282128 248
+ .quad 0x03FCBC19E74FFCBDA
+ .quad 0x03FCBDF71B83DAE7A # 0.217756476365 249
+ .quad 0x03FCBDF71B83DAE7A
+ .quad 0x03FCBF851C067555C # 0.218515604922 250
+ .quad 0x03FCBF851C067555C
+ .quad 0x03FCC11369F0CDB3C # 0.219275310193 251
+ .quad 0x03FCC11369F0CDB3C
+ .quad 0x03FCC2A205610593E # 0.220035593055 252
+ .quad 0x03FCC2A205610593E
+ .quad 0x03FCC430EE755023B # 0.220796454387 253
+ .quad 0x03FCC430EE755023B
+ .quad 0x03FCC5C0254BF23A8 # 0.221557895069 254
+ .quad 0x03FCC5C0254BF23A8
+ .quad 0x03FCC79F9AB632BF1 # 0.222472389875 255
+ .quad 0x03FCC79F9AB632BF1
+ .quad 0x03FCC92F7D09ABE20 # 0.223235108240 256
+ .quad 0x03FCC92F7D09ABE20
+ .quad 0x03FCCABFAD80D023D # 0.223998408788 257
+ .quad 0x03FCCABFAD80D023D
+ .quad 0x03FCCC502C3A2F1E8 # 0.224762292410 258
+ .quad 0x03FCCC502C3A2F1E8
+ .quad 0x03FCCDE0F9546A5E7 # 0.225526759995 259
+ .quad 0x03FCCDE0F9546A5E7
+ .quad 0x03FCCF7214EE356E9 # 0.226291812439 260
+ .quad 0x03FCCF7214EE356E9
+ .quad 0x03FCD1037F2655E7B # 0.227057450635 261
+ .quad 0x03FCD1037F2655E7B
+ .quad 0x03FCD295381BA37E9 # 0.227823675483 262
+ .quad 0x03FCD295381BA37E9
+ .quad 0x03FCD4273FED08111 # 0.228590487882 263
+ .quad 0x03FCD4273FED08111
+ .quad 0x03FCD5B996B97FB5F # 0.229357888733 264
+ .quad 0x03FCD5B996B97FB5F
+ .quad 0x03FCD74C3CA018C9C # 0.230125878940 265
+ .quad 0x03FCD74C3CA018C9C
+ .quad 0x03FCD8DF31BFF3FF2 # 0.230894459410 266
+ .quad 0x03FCD8DF31BFF3FF2
+ .quad 0x03FCDA727638446A1 # 0.231663631050 267
+ .quad 0x03FCDA727638446A1
+ .quad 0x03FCDC56CAE452F5B # 0.232587418645 268
+ .quad 0x03FCDC56CAE452F5B
+ .quad 0x03FCDDEABE5A3926E # 0.233357894066 269
+ .quad 0x03FCDDEABE5A3926E
+ .quad 0x03FCDF7F018CE771F # 0.234128963578 270
+ .quad 0x03FCDF7F018CE771F
+ .quad 0x03FCE113949BDEC62 # 0.234900628096 271
+ .quad 0x03FCE113949BDEC62
+ .quad 0x03FCE2A877A6B2C0F # 0.235672888541 272
+ .quad 0x03FCE2A877A6B2C0F
+ .quad 0x03FCE43DAACD09BEC # 0.236445745833 273
+ .quad 0x03FCE43DAACD09BEC
+ .quad 0x03FCE5D32E2E9CE87 # 0.237219200895 274
+ .quad 0x03FCE5D32E2E9CE87
+ .quad 0x03FCE76901EB38427 # 0.237993254653 275
+ .quad 0x03FCE76901EB38427
+ .quad 0x03FCE8ADE53F76866 # 0.238612929343 276
+ .quad 0x03FCE8ADE53F76866
+ .quad 0x03FCEA4449F04AAF4 # 0.239388063093 277
+ .quad 0x03FCEA4449F04AAF4
+ .quad 0x03FCEBDAFF5593E99 # 0.240163798141 278
+ .quad 0x03FCEBDAFF5593E99
+ .quad 0x03FCED72058F666C5 # 0.240940135421 279
+ .quad 0x03FCED72058F666C5
+ .quad 0x03FCEF095CBDE9937 # 0.241717075868 280
+ .quad 0x03FCEF095CBDE9937
+ .quad 0x03FCF0A1050157ED6 # 0.242494620422 281
+ .quad 0x03FCF0A1050157ED6
+ .quad 0x03FCF238FE79FF4BF # 0.243272770021 282
+ .quad 0x03FCF238FE79FF4BF
+ .quad 0x03FCF3D1494840D2F # 0.244051525609 283
+ .quad 0x03FCF3D1494840D2F
+ .quad 0x03FCF569E58C91077 # 0.244830888130 284
+ .quad 0x03FCF569E58C91077
+ .quad 0x03FCF702D36777DF0 # 0.245610858531 285
+ .quad 0x03FCF702D36777DF0
+ .quad 0x03FCF89C12F990D0C # 0.246391437760 286
+ .quad 0x03FCF89C12F990D0C
+ .quad 0x03FCFA35A4638AE2C # 0.247172626770 287
+ .quad 0x03FCFA35A4638AE2C
+ .quad 0x03FCFB7D86EEE3B92 # 0.247798017660 288
+ .quad 0x03FCFB7D86EEE3B92
+ .quad 0x03FCFD17ABFCDB683 # 0.248580306677 289
+ .quad 0x03FCFD17ABFCDB683
+ .quad 0x03FCFEB2233EA07CB # 0.249363208150 290
+ .quad 0x03FCFEB2233EA07CB
+ .quad 0x03FD0026766A9671C # 0.250146723037 291
+ .quad 0x03FD0026766A9671C
+ .quad 0x03FD00F40470C7323 # 0.250930852302 292
+ .quad 0x03FD00F40470C7323
+ .quad 0x03FD01C1BBC2735A3 # 0.251715596908 293
+ .quad 0x03FD01C1BBC2735A3
+ .quad 0x03FD028F9C7035C1D # 0.252500957822 294
+ .quad 0x03FD028F9C7035C1D
+ .quad 0x03FD03346E0106062 # 0.253129690945 295
+ .quad 0x03FD03346E0106062
+ .quad 0x03FD0402994B4F041 # 0.253916163656 296
+ .quad 0x03FD0402994B4F041
+ .quad 0x03FD04D0EE20620AF # 0.254703255393 297
+ .quad 0x03FD04D0EE20620AF
+ .quad 0x03FD059F6C910034D # 0.255490967131 298
+ .quad 0x03FD059F6C910034D
+ .quad 0x03FD066E14ADF4BFD # 0.256279299848 299
+ .quad 0x03FD066E14ADF4BFD
+ .quad 0x03FD07138604D5864 # 0.256910413785 300
+ .quad 0x03FD07138604D5864
+ .quad 0x03FD07E2794F3E8C1 # 0.257699866735 301
+ .quad 0x03FD07E2794F3E8C1
+ .quad 0x03FD08B196753A125 # 0.258489943414 302
+ .quad 0x03FD08B196753A125
+ .quad 0x03FD0980DD87BA2DD # 0.259280644807 303
+ .quad 0x03FD0980DD87BA2DD
+ .quad 0x03FD0A504E97BB40C # 0.260071971904 304
+ .quad 0x03FD0A504E97BB40C
+ .quad 0x03FD0AF660EB9E278 # 0.260705484754 305
+ .quad 0x03FD0AF660EB9E278
+ .quad 0x03FD0BC61DBBA97CB # 0.261497940616 306
+ .quad 0x03FD0BC61DBBA97CB
+ .quad 0x03FD0C9604B8FC51E # 0.262291024962 307
+ .quad 0x03FD0C9604B8FC51E
+ .quad 0x03FD0D3C7586CD5E5 # 0.262925945618 308
+ .quad 0x03FD0D3C7586CD5E5
+ .quad 0x03FD0E0CA89A72D29 # 0.263720163752 309
+ .quad 0x03FD0E0CA89A72D29
+ .quad 0x03FD0EDD060B78082 # 0.264515013170 310
+ .quad 0x03FD0EDD060B78082
+ .quad 0x03FD0FAD8DEB1E2C0 # 0.265310494876 311
+ .quad 0x03FD0FAD8DEB1E2C0
+ .quad 0x03FD10547F9D26ABC # 0.265947336165 312
+ .quad 0x03FD10547F9D26ABC
+ .quad 0x03FD1125540925114 # 0.266743958529 313
+ .quad 0x03FD1125540925114
+ .quad 0x03FD11F653144CB8B # 0.267541216005 314
+ .quad 0x03FD11F653144CB8B
+ .quad 0x03FD129DA43F5BE9E # 0.268179479949 315
+ .quad 0x03FD129DA43F5BE9E
+ .quad 0x03FD136EF02E8290C # 0.268977883185 316
+ .quad 0x03FD136EF02E8290C
+ .quad 0x03FD144066EDAE406 # 0.269776924378 317
+ .quad 0x03FD144066EDAE406
+ .quad 0x03FD14E817FF359D7 # 0.270416617347 318
+ .quad 0x03FD14E817FF359D7
+ .quad 0x03FD15B9DBFA9DEC8 # 0.271216809436 319
+ .quad 0x03FD15B9DBFA9DEC8
+ .quad 0x03FD168BCAF73B3EB # 0.272017642345 320
+ .quad 0x03FD168BCAF73B3EB
+ .quad 0x03FD1733DC5D68DE8 # 0.272658770753 321
+ .quad 0x03FD1733DC5D68DE8
+ .quad 0x03FD180618EF18ADE # 0.273460759729 322
+ .quad 0x03FD180618EF18ADE
+ .quad 0x03FD18D880B3826FE # 0.274263392407 323
+ .quad 0x03FD18D880B3826FE
+ .quad 0x03FD1980F2DD42B6F # 0.274905962710 324
+ .quad 0x03FD1980F2DD42B6F
+ .quad 0x03FD1A53A8902E70B # 0.275709756661 325
+ .quad 0x03FD1A53A8902E70B
+ .quad 0x03FD1AFC59297024D # 0.276353257326 326
+ .quad 0x03FD1AFC59297024D
+ .quad 0x03FD1BCF5D04AE1EA # 0.277158215914 327
+ .quad 0x03FD1BCF5D04AE1EA
+ .quad 0x03FD1CA28C64BAE54 # 0.277963822983 328
+ .quad 0x03FD1CA28C64BAE54
+ .quad 0x03FD1D4B9E796C245 # 0.278608776246 329
+ .quad 0x03FD1D4B9E796C245
+ .quad 0x03FD1E1F1C5C3A06C # 0.279415553216 330
+ .quad 0x03FD1E1F1C5C3A06C
+ .quad 0x03FD1EC86D5747AAD # 0.280061443760 331
+ .quad 0x03FD1EC86D5747AAD
+ .quad 0x03FD1F9C39F74C559 # 0.280869394034 332
+ .quad 0x03FD1F9C39F74C559
+ .quad 0x03FD2070326F1F789 # 0.281677997620 333
+ .quad 0x03FD2070326F1F789
+ .quad 0x03FD2119E59F8789C # 0.282325351583 334
+ .quad 0x03FD2119E59F8789C
+ .quad 0x03FD21EE2D300381C # 0.283135133796 335
+ .quad 0x03FD21EE2D300381C
+ .quad 0x03FD22981FBEF797A # 0.283783432036 336
+ .quad 0x03FD22981FBEF797A
+ .quad 0x03FD236CB6A339EED # 0.284594396317 337
+ .quad 0x03FD236CB6A339EED
+ .quad 0x03FD2416E8C01F606 # 0.285243641592 338
+ .quad 0x03FD2416E8C01F606
+ .quad 0x03FD24EBCF3387FF6 # 0.286055791397 339
+ .quad 0x03FD24EBCF3387FF6
+ .quad 0x03FD2596410DF963A # 0.286705986479 340
+ .quad 0x03FD2596410DF963A
+ .quad 0x03FD266B774C2AF55 # 0.287519325279 341
+ .quad 0x03FD266B774C2AF55
+ .quad 0x03FD27162913F873F # 0.288170472950 342
+ .quad 0x03FD27162913F873F
+ .quad 0x03FD27EBAF58D8C9C # 0.288985004232 343
+ .quad 0x03FD27EBAF58D8C9C
+ .quad 0x03FD2896A13E086A3 # 0.289637107288 344
+ .quad 0x03FD2896A13E086A3
+ .quad 0x03FD296C77C5C0E13 # 0.290452834554 345
+ .quad 0x03FD296C77C5C0E13
+ .quad 0x03FD2A17A9F88EDD2 # 0.291105895801 346
+ .quad 0x03FD2A17A9F88EDD2
+ .quad 0x03FD2AEDD0FF8CC2C # 0.291922822568 347
+ .quad 0x03FD2AEDD0FF8CC2C
+ .quad 0x03FD2B9943B06BD77 # 0.292576844829 348
+ .quad 0x03FD2B9943B06BD77
+ .quad 0x03FD2C6FBB7360D0E # 0.293394974630 349
+ .quad 0x03FD2C6FBB7360D0E
+ .quad 0x03FD2D1B6ED2FA90C # 0.294049960734 350
+ .quad 0x03FD2D1B6ED2FA90C
+ .quad 0x03FD2DC73F01B0DD4 # 0.294705376127 351
+ .quad 0x03FD2DC73F01B0DD4
+ .quad 0x03FD2E9E2BCE12286 # 0.295525249913 352
+ .quad 0x03FD2E9E2BCE12286
+ .quad 0x03FD2F4A3CF22EDC2 # 0.296181633264 353
+ .quad 0x03FD2F4A3CF22EDC2
+ .quad 0x03FD30217B1006601 # 0.297002718785 354
+ .quad 0x03FD30217B1006601
+ .quad 0x03FD30CDCD5ABA762 # 0.297660072959 355
+ .quad 0x03FD30CDCD5ABA762
+ .quad 0x03FD31A55D07A8590 # 0.298482373803 356
+ .quad 0x03FD31A55D07A8590
+ .quad 0x03FD3251F0AA5CC1A # 0.299140701674 357
+ .quad 0x03FD3251F0AA5CC1A
+ .quad 0x03FD32FEA167A6D70 # 0.299799463226 358
+ .quad 0x03FD32FEA167A6D70
+ .quad 0x03FD33D6A7509D491 # 0.300623525901 359
+ .quad 0x03FD33D6A7509D491
+ .quad 0x03FD348399ADA9D94 # 0.301283265328 360
+ .quad 0x03FD348399ADA9D94
+ .quad 0x03FD3530A9454ADC9 # 0.301943440298 361
+ .quad 0x03FD3530A9454ADC9
+ .quad 0x03FD360925EC44F5C # 0.302769272371 362
+ .quad 0x03FD360925EC44F5C
+ .quad 0x03FD36B6776BE1116 # 0.303430429420 363
+ .quad 0x03FD36B6776BE1116
+ .quad 0x03FD378F469437FB4 # 0.304257490918 364
+ .quad 0x03FD378F469437FB4
+ .quad 0x03FD383CDA2E14ECB # 0.304919632971 365
+ .quad 0x03FD383CDA2E14ECB
+ .quad 0x03FD38EA8B3924521 # 0.305582213748 366
+ .quad 0x03FD38EA8B3924521
+ .quad 0x03FD39C3D1FD60E74 # 0.306411057558 367
+ .quad 0x03FD39C3D1FD60E74
+ .quad 0x03FD3A71C56BB48C7 # 0.307074627589 368
+ .quad 0x03FD3A71C56BB48C7
+ .quad 0x03FD3B1FD66BC8D10 # 0.307738638238 369
+ .quad 0x03FD3B1FD66BC8D10
+ .quad 0x03FD3BF995502CB5C # 0.308569272059 370
+ .quad 0x03FD3BF995502CB5C
+ .quad 0x03FD3CA7E8FD01DF6 # 0.309234276240 371
+ .quad 0x03FD3CA7E8FD01DF6
+ .quad 0x03FD3D565A5C5BF11 # 0.309899722945 372
+ .quad 0x03FD3D565A5C5BF11
+ .quad 0x03FD3E3091E6049FB # 0.310732154526 373
+ .quad 0x03FD3E3091E6049FB
+ .quad 0x03FD3EDF463C1683E # 0.311398599069 374
+ .quad 0x03FD3EDF463C1683E
+ .quad 0x03FD3F8E1865A82DD # 0.312065488057 375
+ .quad 0x03FD3F8E1865A82DD
+ .quad 0x03FD403D086CEA79B # 0.312732822082 376
+ .quad 0x03FD403D086CEA79B
+ .quad 0x03FD4117DE854CA15 # 0.313567616354 377
+ .quad 0x03FD4117DE854CA15
+ .quad 0x03FD41C711E4BA15E # 0.314235953889 378
+ .quad 0x03FD41C711E4BA15E
+ .quad 0x03FD427663431B221 # 0.314904738398 379
+ .quad 0x03FD427663431B221
+ .quad 0x03FD4325D2AAB6F18 # 0.315573970480 380
+ .quad 0x03FD4325D2AAB6F18
+ .quad 0x03FD44014838E5513 # 0.316411140893 381
+ .quad 0x03FD44014838E5513
+ .quad 0x03FD44B0FB5AF4F44 # 0.317081382205 382
+ .quad 0x03FD44B0FB5AF4F44
+ .quad 0x03FD4560CCA7CB3B2 # 0.317752073041 383
+ .quad 0x03FD4560CCA7CB3B2
+ .quad 0x03FD4610BC29C5E18 # 0.318423214006 384
+ .quad 0x03FD4610BC29C5E18
+ .quad 0x03FD46ECD216CDCB5 # 0.319262774126 385
+ .quad 0x03FD46ECD216CDCB5
+ .quad 0x03FD479D05B65CB60 # 0.319934930091 386
+ .quad 0x03FD479D05B65CB60
+ .quad 0x03FD484D57ACE5A1A # 0.320607538154 387
+ .quad 0x03FD484D57ACE5A1A
+ .quad 0x03FD48FDC804DD1CB # 0.321280598924 388
+ .quad 0x03FD48FDC804DD1CB
+ .quad 0x03FD49DA7F3BCC420 # 0.322122562432 389
+ .quad 0x03FD49DA7F3BCC420
+ .quad 0x03FD4A8B341552B09 # 0.322796644021 390
+ .quad 0x03FD4A8B341552B09
+ .quad 0x03FD4B3C077267E9A # 0.323471180303 391
+ .quad 0x03FD4B3C077267E9A
+ .quad 0x03FD4BECF95D97914 # 0.324146171892 392
+ .quad 0x03FD4BECF95D97914
+ .quad 0x03FD4C9E09E172C3D # 0.324821619401 393
+ .quad 0x03FD4C9E09E172C3D
+ .quad 0x03FD4D4F3908901A0 # 0.325497523449 394
+ .quad 0x03FD4D4F3908901A0
+ .quad 0x03FD4E2CDF1F341C1 # 0.326343046455 395
+ .quad 0x03FD4E2CDF1F341C1
+ .quad 0x03FD4EDE535C79642 # 0.327019979972 396
+ .quad 0x03FD4EDE535C79642
+ .quad 0x03FD4F8FE65F90500 # 0.327697372039 397
+ .quad 0x03FD4F8FE65F90500
+ .quad 0x03FD5041983326F2D # 0.328375223276 398
+ .quad 0x03FD5041983326F2D
+ .quad 0x03FD50F368E1F0F02 # 0.329053534308 399
+ .quad 0x03FD50F368E1F0F02
+ .quad 0x03FD51A55876A77F5 # 0.329732305758 400
+ .quad 0x03FD51A55876A77F5
+ .quad 0x03FD5283EF743F98B # 0.330581418486 401
+ .quad 0x03FD5283EF743F98B
+ .quad 0x03FD533624B59CA35 # 0.331261228165 402
+ .quad 0x03FD533624B59CA35
+ .quad 0x03FD53E878FFE6EAE # 0.331941500300 403
+ .quad 0x03FD53E878FFE6EAE
+ .quad 0x03FD549AEC5DEF880 # 0.332622235521 404
+ .quad 0x03FD549AEC5DEF880
+ .quad 0x03FD554D7EDA8D3C4 # 0.333303434457 405
+ .quad 0x03FD554D7EDA8D3C4
+ .quad 0x03FD560030809C759 # 0.333985097742 406
+ .quad 0x03FD560030809C759
+ .quad 0x03FD56B3015AFF52C # 0.334667226008 407
+ .quad 0x03FD56B3015AFF52C
+ .quad 0x03FD5765F1749DA6C # 0.335349819892 408
+ .quad 0x03FD5765F1749DA6C
+ .quad 0x03FD581900D864FD7 # 0.336032880027 409
+ .quad 0x03FD581900D864FD7
+ .quad 0x03FD58CC2F91489F5 # 0.336716407053 410
+ .quad 0x03FD58CC2F91489F5
+ .quad 0x03FD59AC5618CCE38 # 0.337571473373 411
+ .quad 0x03FD59AC5618CCE38
+ .quad 0x03FD5A5FCB795780C # 0.338256053239 412
+ .quad 0x03FD5A5FCB795780C
+ .quad 0x03FD5B136052BCE39 # 0.338941102075 413
+ .quad 0x03FD5B136052BCE39
+ .quad 0x03FD5BC714B008E23 # 0.339626620526 414
+ .quad 0x03FD5BC714B008E23
+ .quad 0x03FD5C7AE89C4D254 # 0.340312609234 415
+ .quad 0x03FD5C7AE89C4D254
+ .quad 0x03FD5D2EDC22A12BA # 0.340999068845 416
+ .quad 0x03FD5D2EDC22A12BA
+ .quad 0x03FD5DE2EF4E224D6 # 0.341686000008 417
+ .quad 0x03FD5DE2EF4E224D6
+ .quad 0x03FD5E972229F3C15 # 0.342373403369 418
+ .quad 0x03FD5E972229F3C15
+ .quad 0x03FD5F4B74C13EA04 # 0.343061279578 419
+ .quad 0x03FD5F4B74C13EA04
+ .quad 0x03FD5FFFE71F31E9A # 0.343749629287 420
+ .quad 0x03FD5FFFE71F31E9A
+ .quad 0x03FD60B4794F02875 # 0.344438453147 421
+ .quad 0x03FD60B4794F02875
+ .quad 0x03FD61692B5BEB520 # 0.345127751813 422
+ .quad 0x03FD61692B5BEB520
+ .quad 0x03FD621DFD512D14F # 0.345817525940 423
+ .quad 0x03FD621DFD512D14F
+ .quad 0x03FD62D2EF3A0E933 # 0.346507776183 424
+ .quad 0x03FD62D2EF3A0E933
+ .quad 0x03FD63880121DC8AB # 0.347198503200 425
+ .quad 0x03FD63880121DC8AB
+ .quad 0x03FD643D3313E9B92 # 0.347889707652 426
+ .quad 0x03FD643D3313E9B92
+ .quad 0x03FD64F2851B8EE01 # 0.348581390197 427
+ .quad 0x03FD64F2851B8EE01
+ .quad 0x03FD65A7F7442AC90 # 0.349273551498 428
+ .quad 0x03FD65A7F7442AC90
+ .quad 0x03FD665D8999224A5 # 0.349966192218 429
+ .quad 0x03FD665D8999224A5
+ .quad 0x03FD67133C25E04A5 # 0.350659313022 430
+ .quad 0x03FD67133C25E04A5
+ .quad 0x03FD67C90EF5D5C4C # 0.351352914576 431
+ .quad 0x03FD67C90EF5D5C4C
+ .quad 0x03FD687F021479CEE # 0.352046997547 432
+ .quad 0x03FD687F021479CEE
+ .quad 0x03FD6935158D499B3 # 0.352741562603 433
+ .quad 0x03FD6935158D499B3
+ .quad 0x03FD69EB496BC87E5 # 0.353436610416 434
+ .quad 0x03FD69EB496BC87E5
+ .quad 0x03FD6AA19DBB7FF34 # 0.354132141656 435
+ .quad 0x03FD6AA19DBB7FF34
+ .quad 0x03FD6B581287FF9FD # 0.354828156996 436
+ .quad 0x03FD6B581287FF9FD
+ .quad 0x03FD6C0EA7DCDD591 # 0.355524657112 437
+ .quad 0x03FD6C0EA7DCDD591
+ .quad 0x03FD6C97AD3CFCFD9 # 0.356047350738 438
+ .quad 0x03FD6C97AD3CFCFD9
+ .quad 0x03FD6D4E7B9C727EC # 0.356744700836 439
+ .quad 0x03FD6D4E7B9C727EC
+ .quad 0x03FD6E056AA4421D6 # 0.357442537571 440
+ .quad 0x03FD6E056AA4421D6
+ .quad 0x03FD6EBC7A6019066 # 0.358140861621 441
+ .quad 0x03FD6EBC7A6019066
+ .quad 0x03FD6F73AADBAAAB7 # 0.358839673669 442
+ .quad 0x03FD6F73AADBAAAB7
+ .quad 0x03FD702AFC22B0C6D # 0.359538974397 443
+ .quad 0x03FD702AFC22B0C6D
+ .quad 0x03FD70E26E40EB5FA # 0.360238764489 444
+ .quad 0x03FD70E26E40EB5FA
+ .quad 0x03FD719A014220CF5 # 0.360939044629 445
+ .quad 0x03FD719A014220CF5
+ .quad 0x03FD7251B5321DC54 # 0.361639815506 446
+ .quad 0x03FD7251B5321DC54
+ .quad 0x03FD73098A1CB54BA # 0.362341077807 447
+ .quad 0x03FD73098A1CB54BA
+ .quad 0x03FD73937F783CEBA # 0.362867347444 448
+ .quad 0x03FD73937F783CEBA
+ .quad 0x03FD744B8E35E9EDA # 0.363569471398 449
+ .quad 0x03FD744B8E35E9EDA
+ .quad 0x03FD7503BE0ED6C66 # 0.364272088676 450
+ .quad 0x03FD7503BE0ED6C66
+ .quad 0x03FD75BC0F0EEE7DE # 0.364975199972 451
+ .quad 0x03FD75BC0F0EEE7DE
+ .quad 0x03FD76748142228C7 # 0.365678805982 452
+ .quad 0x03FD76748142228C7
+ .quad 0x03FD772D14B46AE00 # 0.366382907402 453
+ .quad 0x03FD772D14B46AE00
+ .quad 0x03FD77E5C971C5E06 # 0.367087504930 454
+ .quad 0x03FD77E5C971C5E06
+ .quad 0x03FD787066E04915F # 0.367616279067 455
+ .quad 0x03FD787066E04915F
+ .quad 0x03FD792955FDF47A3 # 0.368321746469 456
+ .quad 0x03FD792955FDF47A3
+ .quad 0x03FD79E26687CFB3D # 0.369027711906 457
+ .quad 0x03FD79E26687CFB3D
+ .quad 0x03FD7A9B9889F19E2 # 0.369734176082 458
+ .quad 0x03FD7A9B9889F19E2
+ .quad 0x03FD7B54EC1077A48 # 0.370441139703 459
+ .quad 0x03FD7B54EC1077A48
+ .quad 0x03FD7C0E612785C74 # 0.371148603475 460
+ .quad 0x03FD7C0E612785C74
+ .quad 0x03FD7C998F06FB152 # 0.371679529954 461
+ .quad 0x03FD7C998F06FB152
+ .quad 0x03FD7D533EF841E8A # 0.372387870696 462
+ .quad 0x03FD7D533EF841E8A
+ .quad 0x03FD7E0D109B95F19 # 0.373096713539 463
+ .quad 0x03FD7E0D109B95F19
+ .quad 0x03FD7EC703FD340AA # 0.373806059198 464
+ .quad 0x03FD7EC703FD340AA
+ .quad 0x03FD7F8119295FB9B # 0.374515908385 465
+ .quad 0x03FD7F8119295FB9B
+ .quad 0x03FD800CBF3ED1CC2 # 0.375048626146 466
+ .quad 0x03FD800CBF3ED1CC2
+ .quad 0x03FD80C70FAB0BDF6 # 0.375759358229 467
+ .quad 0x03FD80C70FAB0BDF6
+ .quad 0x03FD81818203AFC7F # 0.376470595813 468
+ .quad 0x03FD81818203AFC7F
+ .quad 0x03FD823C16551A3C3 # 0.377182339615 469
+ .quad 0x03FD823C16551A3C3
+ .quad 0x03FD82C81BE4DFF4A # 0.377716480107 470
+ .quad 0x03FD82C81BE4DFF4A
+ .quad 0x03FD8382EBC7794D1 # 0.378429111528 471
+ .quad 0x03FD8382EBC7794D1
+ .quad 0x03FD843DDDC4FB137 # 0.379142251156 472
+ .quad 0x03FD843DDDC4FB137
+ .quad 0x03FD84F8F1E9DB72B # 0.379855899714 473
+ .quad 0x03FD84F8F1E9DB72B
+ .quad 0x03FD85855776DCBFB # 0.380391470556 474
+ .quad 0x03FD85855776DCBFB
+ .quad 0x03FD8640A77EB3957 # 0.381106011494 475
+ .quad 0x03FD8640A77EB3957
+ .quad 0x03FD86FC19D05148E # 0.381821063366 476
+ .quad 0x03FD86FC19D05148E
+ .quad 0x03FD87B7AE7845C0F # 0.382536626902 477
+ .quad 0x03FD87B7AE7845C0F
+ .quad 0x03FD8844748678822 # 0.383073635776 478
+ .quad 0x03FD8844748678822
+ .quad 0x03FD89004563D3DFD # 0.383790096491 479
+ .quad 0x03FD89004563D3DFD
+ .quad 0x03FD89BC38BA356B4 # 0.384507070890 480
+ .quad 0x03FD89BC38BA356B4
+ .quad 0x03FD8A4945E20894E # 0.385045139237 481
+ .quad 0x03FD8A4945E20894E
+ .quad 0x03FD8B0575AAB1FC5 # 0.385763014358 482
+ .quad 0x03FD8B0575AAB1FC5
+ .quad 0x03FD8BC1C80F45A32 # 0.386481405193 483
+ .quad 0x03FD8BC1C80F45A32
+ .quad 0x03FD8C7E3D1C80B2F # 0.387200312485 484
+ .quad 0x03FD8C7E3D1C80B2F
+ .quad 0x03FD8D0BABACC89EE # 0.387739832326 485
+ .quad 0x03FD8D0BABACC89EE
+ .quad 0x03FD8DC85D7FE5013 # 0.388459645206 486
+ .quad 0x03FD8DC85D7FE5013
+ .quad 0x03FD8E85321ED5598 # 0.389179976589 487
+ .quad 0x03FD8E85321ED5598
+ .quad 0x03FD8F12E873862C7 # 0.389720565845 488
+ .quad 0x03FD8F12E873862C7
+ .quad 0x03FD8FCFFA1614AA0 # 0.390441806410 489
+ .quad 0x03FD8FCFFA1614AA0
+ .quad 0x03FD908D2EA7D9511 # 0.391163567538 490
+ .quad 0x03FD908D2EA7D9511
+ .quad 0x03FD911B2D09ED9D6 # 0.391705230456 491
+ .quad 0x03FD911B2D09ED9D6
+ .quad 0x03FD91D89EDD6B7FF # 0.392427904381 492
+ .quad 0x03FD91D89EDD6B7FF
+ .quad 0x03FD929633C3B7D3E # 0.393151100941 493
+ .quad 0x03FD929633C3B7D3E
+ .quad 0x03FD93247A7C99B52 # 0.393693841796 494
+ .quad 0x03FD93247A7C99B52
+ .quad 0x03FD93E24CE3195E8 # 0.394417954789 495
+ .quad 0x03FD93E24CE3195E8
+ .quad 0x03FD9470C1CB1962E # 0.394961383840 496
+ .quad 0x03FD9470C1CB1962E
+ .quad 0x03FD952ED1D9C0435 # 0.395686415592 497
+ .quad 0x03FD952ED1D9C0435
+ .quad 0x03FD95ED0535EA5D9 # 0.396411973396 498
+ .quad 0x03FD95ED0535EA5D9
+ .quad 0x03FD967BC2EDCCE17 # 0.396956487431 499
+ .quad 0x03FD967BC2EDCCE17
+ .quad 0x03FD973A3431356AE # 0.397682967666 500
+ .quad 0x03FD973A3431356AE
+ .quad 0x03FD97F8C8E64A1C7 # 0.398409976059 501
+ .quad 0x03FD97F8C8E64A1C7
+ .quad 0x03FD9887CFB8A3932 # 0.398955579419 502
+ .quad 0x03FD9887CFB8A3932
+ .quad 0x03FD9946A2946EF3C # 0.399683513937 503
+ .quad 0x03FD9946A2946EF3C
+ .quad 0x03FD99D5D8130607C # 0.400229812776 504
+ .quad 0x03FD99D5D8130607C
+ .quad 0x03FD9A94E93E1EC37 # 0.400958675782 505
+ .quad 0x03FD9A94E93E1EC37
+ .quad 0x03FD9B244D87735E8 # 0.401505671875 506
+ .quad 0x03FD9B244D87735E8
+ .quad 0x03FD9BE39D2A97F0B # 0.402235465741 507
+ .quad 0x03FD9BE39D2A97F0B
+ .quad 0x03FD9CA3109266E23 # 0.402965792595 508
+ .quad 0x03FD9CA3109266E23
+ .quad 0x03FD9D32BEA15ED3A # 0.403513887977 509
+ .quad 0x03FD9D32BEA15ED3A
+ .quad 0x03FD9DF270C1914A8 # 0.404245149435 510
+ .quad 0x03FD9DF270C1914A8
+ .quad 0x03FD9E824DEA3E135 # 0.404793946669 511
+ .quad 0x03FD9E824DEA3E135
+ .quad 0x03FD9F423EEBF9DA1 # 0.405526145127 512
+ .quad 0x03FD9F423EEBF9DA1
+ .quad 0x03FD9FD24B4D47012 # 0.406075646011 513
+ .quad 0x03FD9FD24B4D47012
+ .quad 0x03FDA0927B59DA6E2 # 0.406808783874 514
+ .quad 0x03FDA0927B59DA6E2
+ .quad 0x03FDA152CF7F3B46D # 0.407542459622 515
+ .quad 0x03FDA152CF7F3B46D
+ .quad 0x03FDA1E32653B420E # 0.408093069896 516
+ .quad 0x03FDA1E32653B420E
+ .quad 0x03FDA2A3B9C527DB1 # 0.408827688845 517
+ .quad 0x03FDA2A3B9C527DB1
+ .quad 0x03FDA33440224FA79 # 0.409379007429 518
+ .quad 0x03FDA33440224FA79
+ .quad 0x03FDA3F513098DD09 # 0.410114572008 519
+ .quad 0x03FDA3F513098DD09
+ .quad 0x03FDA485C90EBDB0C # 0.410666600728 520
+ .quad 0x03FDA485C90EBDB0C
+ .quad 0x03FDA546DB95A721A # 0.411403113374 521
+ .quad 0x03FDA546DB95A721A
+ .quad 0x03FDA5D7C16257437 # 0.411955854060 522
+ .quad 0x03FDA5D7C16257437
+ .quad 0x03FDA69913B2F6572 # 0.412693317221 523
+ .quad 0x03FDA69913B2F6572
+ .quad 0x03FDA72A2966BE1EA # 0.413246771713 524
+ .quad 0x03FDA72A2966BE1EA
+ .quad 0x03FDA7EBBBAB46E8B # 0.413985187844 525
+ .quad 0x03FDA7EBBBAB46E8B
+ .quad 0x03FDA87D0165DD199 # 0.414539357989 526
+ .quad 0x03FDA87D0165DD199
+ .quad 0x03FDA93ED3C8AD9E3 # 0.415278729556 527
+ .quad 0x03FDA93ED3C8AD9E3
+ .quad 0x03FDA9D049A9E884A # 0.415833617206 528
+ .quad 0x03FDA9D049A9E884A
+ .quad 0x03FDAA925C5588EFA # 0.416573946686 529
+ .quad 0x03FDAA925C5588EFA
+ .quad 0x03FDAB24027D5E8AF # 0.417129553701 530
+ .quad 0x03FDAB24027D5E8AF
+ .quad 0x03FDABE6559C8167C # 0.417870843580 531
+ .quad 0x03FDABE6559C8167C
+ .quad 0x03FDAC782C2B07944 # 0.418427171828 532
+ .quad 0x03FDAC782C2B07944
+ .quad 0x03FDAD3ABFE88A06E # 0.419169424599 533
+ .quad 0x03FDAD3ABFE88A06E
+ .quad 0x03FDADCCC6FDF6A80 # 0.419726475955 534
+ .quad 0x03FDADCCC6FDF6A80
+ .quad 0x03FDAE5EE2E961227 # 0.420283837790 535
+ .quad 0x03FDAE5EE2E961227
+ .quad 0x03FDAF21D34189D0A # 0.421027470470 536
+ .quad 0x03FDAF21D34189D0A
+ .quad 0x03FDAFB41FE2167B4 # 0.421585558104 537
+ .quad 0x03FDAFB41FE2167B4
+ .quad 0x03FDB07751416A7F3 # 0.422330159776 538
+ .quad 0x03FDB07751416A7F3
+ .quad 0x03FDB109CEB79DB8A # 0.422888975102 539
+ .quad 0x03FDB109CEB79DB8A
+ .quad 0x03FDB1CD41498DF12 # 0.423634548296 540
+ .quad 0x03FDB1CD41498DF12
+ .quad 0x03FDB25FEFB60CB2E # 0.424194093214 541
+ .quad 0x03FDB25FEFB60CB2E
+ .quad 0x03FDB323A3A63594A # 0.424940640468 542
+ .quad 0x03FDB323A3A63594A
+ .quad 0x03FDB3B68329C59E9 # 0.425500916886 543
+ .quad 0x03FDB3B68329C59E9
+ .quad 0x03FDB44977C148F1A # 0.426061507389 544
+ .quad 0x03FDB44977C148F1A
+ .quad 0x03FDB50D895F7773A # 0.426809450580 545
+ .quad 0x03FDB50D895F7773A
+ .quad 0x03FDB5A0AF3D169CD # 0.427370775322 546
+ .quad 0x03FDB5A0AF3D169CD
+ .quad 0x03FDB66502A41E541 # 0.428119698779 547
+ .quad 0x03FDB66502A41E541
+ .quad 0x03FDB6F859E8EF639 # 0.428681759684 548
+ .quad 0x03FDB6F859E8EF639
+ .quad 0x03FDB78BC664238C0 # 0.429244136679 549
+ .quad 0x03FDB78BC664238C0
+ .quad 0x03FDB85078123E586 # 0.429994464983 550
+ .quad 0x03FDB85078123E586
+ .quad 0x03FDB8E41624226C5 # 0.430557580905 551
+ .quad 0x03FDB8E41624226C5
+ .quad 0x03FDB9A90A06BCB3D # 0.431308895742 552
+ .quad 0x03FDB9A90A06BCB3D
+ .quad 0x03FDBA3CD9D0B81BD # 0.431872752537 553
+ .quad 0x03FDBA3CD9D0B81BD
+ .quad 0x03FDBAD0BEF3DB164 # 0.432436927446 554
+ .quad 0x03FDBAD0BEF3DB164
+ .quad 0x03FDBB9611B80E2FC # 0.433189656123 555
+ .quad 0x03FDBB9611B80E2FC
+ .quad 0x03FDBC2A28C33B75D # 0.433754574696 556
+ .quad 0x03FDBC2A28C33B75D
+ .quad 0x03FDBCBE553C2BDDF # 0.434319812582 557
+ .quad 0x03FDBCBE553C2BDDF
+ .quad 0x03FDBD84073D8EC2B # 0.435073960430 558
+ .quad 0x03FDBD84073D8EC2B
+ .quad 0x03FDBE1865CEC1EC9 # 0.435639944787 559
+ .quad 0x03FDBE1865CEC1EC9
+ .quad 0x03FDBEACD9E271AD1 # 0.436206249662 560
+ .quad 0x03FDBEACD9E271AD1
+ .quad 0x03FDBF72EB7D20355 # 0.436961822044 561
+ .quad 0x03FDBF72EB7D20355
+ .quad 0x03FDC00791D99132B # 0.437528876213 562
+ .quad 0x03FDC00791D99132B
+ .quad 0x03FDC09C4DCD565AB # 0.438096252115 563
+ .quad 0x03FDC09C4DCD565AB
+ .quad 0x03FDC162BF5DF23E4 # 0.438853254422 564
+ .quad 0x03FDC162BF5DF23E4
+ .quad 0x03FDC1F7ADCB3DAB0 # 0.439421382456 565
+ .quad 0x03FDC1F7ADCB3DAB0
+ .quad 0x03FDC28CB1E4D32FD # 0.439989833442 566
+ .quad 0x03FDC28CB1E4D32FD
+ .quad 0x03FDC35383C8850B0 # 0.440748271097 567
+ .quad 0x03FDC35383C8850B0
+ .quad 0x03FDC3E8BA8CACF27 # 0.441317477070 568
+ .quad 0x03FDC3E8BA8CACF27
+ .quad 0x03FDC47E071233744 # 0.441887007223 569
+ .quad 0x03FDC47E071233744
+ .quad 0x03FDC54539A6ABCD2 # 0.442646885679 570
+ .quad 0x03FDC54539A6ABCD2
+ .quad 0x03FDC5DAB908186FF # 0.443217173690 571
+ .quad 0x03FDC5DAB908186FF
+ .quad 0x03FDC6704E4016FF7 # 0.443787787115 572
+ .quad 0x03FDC6704E4016FF7
+ .quad 0x03FDC737E1E38F4FB # 0.444549111857 573
+ .quad 0x03FDC737E1E38F4FB
+ .quad 0x03FDC7CDAA290FEAD # 0.445120486027 574
+ .quad 0x03FDC7CDAA290FEAD
+ .quad 0x03FDC863885A74D16 # 0.445692186852 575
+ .quad 0x03FDC863885A74D16
+ .quad 0x03FDC8F97C7E299DB # 0.446264214707 576
+ .quad 0x03FDC8F97C7E299DB
+ .quad 0x03FDC9C18EDC7C26B # 0.447027427871 577
+ .quad 0x03FDC9C18EDC7C26B
+ .quad 0x03FDCA57B64E9DB05 # 0.447600220249 578
+ .quad 0x03FDCA57B64E9DB05
+ .quad 0x03FDCAEDF3C88A364 # 0.448173340907 579
+ .quad 0x03FDCAEDF3C88A364
+ .quad 0x03FDCB844750B9995 # 0.448746790220 580
+ .quad 0x03FDCB844750B9995
+ .quad 0x03FDCC4CD90B3ECE5 # 0.449511901199 581
+ .quad 0x03FDCC4CD90B3ECE5
+ .quad 0x03FDCCE3602341C10 # 0.450086118843 582
+ .quad 0x03FDCCE3602341C10
+ .quad 0x03FDCD79FD5F2BC77 # 0.450660666403 583
+ .quad 0x03FDCD79FD5F2BC77
+ .quad 0x03FDCE10B0C581284 # 0.451235544257 584
+ .quad 0x03FDCE10B0C581284
+ .quad 0x03FDCED9C27EC6607 # 0.452002562511 585
+ .quad 0x03FDCED9C27EC6607
+ .quad 0x03FDCF70A9B6D3810 # 0.452578212532 586
+ .quad 0x03FDCF70A9B6D3810
+ .quad 0x03FDD007A72F19BBC # 0.453154194116 587
+ .quad 0x03FDD007A72F19BBC
+ .quad 0x03FDD09EBAEE29DD8 # 0.453730507647 588
+ .quad 0x03FDD09EBAEE29DD8
+ .quad 0x03FDD1684D49F46AE # 0.454499442710 589
+ .quad 0x03FDD1684D49F46AE
+ .quad 0x03FDD1FF951D1F1B3 # 0.455076532271 590
+ .quad 0x03FDD1FF951D1F1B3
+ .quad 0x03FDD296F34D0B65C # 0.455653955057 591
+ .quad 0x03FDD296F34D0B65C
+ .quad 0x03FDD32E67E056BD5 # 0.456231711452 592
+ .quad 0x03FDD32E67E056BD5
+ .quad 0x03FDD3C5F2DDA1840 # 0.456809801843 593
+ .quad 0x03FDD3C5F2DDA1840
+ .quad 0x03FDD490246DEFA6A # 0.457581109247 594
+ .quad 0x03FDD490246DEFA6A
+ .quad 0x03FDD527E3D1B95FC # 0.458159980465 595
+ .quad 0x03FDD527E3D1B95FC
+ .quad 0x03FDD5BFB9B5AE71F # 0.458739186968 596
+ .quad 0x03FDD5BFB9B5AE71F
+ .quad 0x03FDD657A6207C0DB # 0.459318729146 597
+ .quad 0x03FDD657A6207C0DB
+ .quad 0x03FDD6EFA918D25CE # 0.459898607388 598
+ .quad 0x03FDD6EFA918D25CE
+ .quad 0x03FDD7BA7AD9E7DA1 # 0.460672301817 599
+ .quad 0x03FDD7BA7AD9E7DA1
+ .quad 0x03FDD852B28BE5A0F # 0.461252965726 600
+ .quad 0x03FDD852B28BE5A0F
+ .quad 0x03FDD8EB00E1CCE14 # 0.461833967001 601
+ .quad 0x03FDD8EB00E1CCE14
+ .quad 0x03FDD98365E25ABB9 # 0.462415306035 602
+ .quad 0x03FDD98365E25ABB9
+ .quad 0x03FDDA1BE1944F538 # 0.462996983220 603
+ .quad 0x03FDDA1BE1944F538
+ .quad 0x03FDDAE75484C9615 # 0.463773079495 604
+ .quad 0x03FDDAE75484C9615
+ .quad 0x03FDDB8005445488B # 0.464355547233 605
+ .quad 0x03FDDB8005445488B
+ .quad 0x03FDDC18CCCBDCB83 # 0.464938354438 606
+ .quad 0x03FDDC18CCCBDCB83
+ .quad 0x03FDDCB1AB222F33D # 0.465521501504 607
+ .quad 0x03FDDCB1AB222F33D
+ .quad 0x03FDDD4AA04E1C4B7 # 0.466104988830 608
+ .quad 0x03FDDD4AA04E1C4B7
+ .quad 0x03FDDDE3AC56775D2 # 0.466688816812 609
+ .quad 0x03FDDDE3AC56775D2
+ .quad 0x03FDDE7CCF4216D6E # 0.467272985848 610
+ .quad 0x03FDDE7CCF4216D6E
+ .quad 0x03FDDF492177D7BBC # 0.468052409114 611
+ .quad 0x03FDDF492177D7BBC
+ .quad 0x03FDDFE279E5BF4EE # 0.468637375496 612
+ .quad 0x03FDDFE279E5BF4EE
+ .quad 0x03FDE07BE94DCC439 # 0.469222684263 613
+ .quad 0x03FDE07BE94DCC439
+ .quad 0x03FDE1156FB6E2626 # 0.469808335817 614
+ .quad 0x03FDE1156FB6E2626
+ .quad 0x03FDE1AF0D27E88D7 # 0.470394330560 615
+ .quad 0x03FDE1AF0D27E88D7
+ .quad 0x03FDE248C1A7C8C26 # 0.470980668894 616
+ .quad 0x03FDE248C1A7C8C26
+ .quad 0x03FDE2E28D3D701CC # 0.471567351222 617
+ .quad 0x03FDE2E28D3D701CC
+ .quad 0x03FDE37C6FEFCED73 # 0.472154377948 618
+ .quad 0x03FDE37C6FEFCED73
+ .quad 0x03FDE449C232C39D8 # 0.472937616681 619
+ .quad 0x03FDE449C232C39D8
+ .quad 0x03FDE4E3DAEDDB5F6 # 0.473525448578 620
+ .quad 0x03FDE4E3DAEDDB5F6
+ .quad 0x03FDE57E0ADCE1EA5 # 0.474113626224 621
+ .quad 0x03FDE57E0ADCE1EA5
+ .quad 0x03FDE6185206D516F # 0.474702150027 622
+ .quad 0x03FDE6185206D516F
+ .quad 0x03FDE6B2B072B5E6F # 0.475291020395 623
+ .quad 0x03FDE6B2B072B5E6F
+ .quad 0x03FDE74D26278887A # 0.475880237735 624
+ .quad 0x03FDE74D26278887A
+ .quad 0x03FDE7E7B32C5453F # 0.476469802457 625
+ .quad 0x03FDE7E7B32C5453F
+ .quad 0x03FDE882578823D52 # 0.477059714970 626
+ .quad 0x03FDE882578823D52
+ .quad 0x03FDE91D134204C67 # 0.477649975686 627
+ .quad 0x03FDE91D134204C67
+ .quad 0x03FDE9B7E6610815A # 0.478240585015 628
+ .quad 0x03FDE9B7E6610815A
+ .quad 0x03FDEA52D0EC41E5E # 0.478831543369 629
+ .quad 0x03FDEA52D0EC41E5E
+ .quad 0x03FDEB218376ECFC0 # 0.479620031484 630
+ .quad 0x03FDEB218376ECFC0
+ .quad 0x03FDEBBCA4C4E9E87 # 0.480211805838 631
+ .quad 0x03FDEBBCA4C4E9E87
+ .quad 0x03FDEC57DD96CD0CB # 0.480803930597 632
+ .quad 0x03FDEC57DD96CD0CB
+ .quad 0x03FDECF32DF3B887D # 0.481396406174 633
+ .quad 0x03FDECF32DF3B887D
+ .quad 0x03FDED8E95E2D1B88 # 0.481989232987 634
+ .quad 0x03FDED8E95E2D1B88
+ .quad 0x03FDEE2A156B413E5 # 0.482582411453 635
+ .quad 0x03FDEE2A156B413E5
+ .quad 0x03FDEEC5AC9432FCB # 0.483175941987 636
+ .quad 0x03FDEEC5AC9432FCB
+ .quad 0x03FDEF615B64D61C7 # 0.483769825010 637
+ .quad 0x03FDEF615B64D61C7
+ .quad 0x03FDEFFD21E45D0D1 # 0.484364060939 638
+ .quad 0x03FDEFFD21E45D0D1
+ .quad 0x03FDF0990019FD887 # 0.484958650194 639
+ .quad 0x03FDF0990019FD887
+ .quad 0x03FDF134F60CF092D # 0.485553593197 640
+ .quad 0x03FDF134F60CF092D
+ .quad 0x03FDF1D103C4727E4 # 0.486148890367 641
+ .quad 0x03FDF1D103C4727E4
+ .quad 0x03FDF26D2947C2EC5 # 0.486744542127 642
+ .quad 0x03FDF26D2947C2EC5
+ .quad 0x03FDF309669E24CF9 # 0.487340548899 643
+ .quad 0x03FDF309669E24CF9
+ .quad 0x03FDF3A5BBCEDE6E1 # 0.487936911107 644
+ .quad 0x03FDF3A5BBCEDE6E1
+ .quad 0x03FDF44228E13963A # 0.488533629176 645
+ .quad 0x03FDF44228E13963A
+ .quad 0x03FDF4DEADDC82A35 # 0.489130703529 646
+ .quad 0x03FDF4DEADDC82A35
+ .quad 0x03FDF57B4AC80A79A # 0.489728134594 647
+ .quad 0x03FDF57B4AC80A79A
+ .quad 0x03FDF617FFAB248ED # 0.490325922795 648
+ .quad 0x03FDF617FFAB248ED
+ .quad 0x03FDF6B4CC8D27E87 # 0.490924068561 649
+ .quad 0x03FDF6B4CC8D27E87
+ .quad 0x03FDF751B1756EEC8 # 0.491522572320 650
+ .quad 0x03FDF751B1756EEC8
+ .quad 0x03FDF7EEAE6B5761C # 0.492121434499 651
+ .quad 0x03FDF7EEAE6B5761C
+ .quad 0x03FDF88BC3764273B # 0.492720655530 652
+ .quad 0x03FDF88BC3764273B
+ .quad 0x03FDF928F09D94B32 # 0.493320235842 653
+ .quad 0x03FDF928F09D94B32
+ .quad 0x03FDF9C635E8B6192 # 0.493920175866 654
+ .quad 0x03FDF9C635E8B6192
+ .quad 0x03FDFA63935F1208C # 0.494520476034 655
+ .quad 0x03FDFA63935F1208C
+ .quad 0x03FDFB0109081751A # 0.495121136779 656
+ .quad 0x03FDFB0109081751A
+ .quad 0x03FDFB9E96EB38311 # 0.495722158534 657
+ .quad 0x03FDFB9E96EB38311
+ .quad 0x03FDFC3C3D0FEA555 # 0.496323541733 658
+ .quad 0x03FDFC3C3D0FEA555
+ .quad 0x03FDFCD9FB7DA6DEF # 0.496925286812 659
+ .quad 0x03FDFCD9FB7DA6DEF
+ .quad 0x03FDFD77D23BEA634 # 0.497527394206 660
+ .quad 0x03FDFD77D23BEA634
+ .quad 0x03FDFE15C15234EE2 # 0.498129864352 661
+ .quad 0x03FDFE15C15234EE2
+ .quad 0x03FDFEB3C8C80A04E # 0.498732697687 662
+ .quad 0x03FDFEB3C8C80A04E
+ .quad 0x03FDFF51E8A4F0A74 # 0.499335894649 663
+ .quad 0x03FDFF51E8A4F0A74
+ .quad 0x03FDFFF020F07352E # 0.499939455677 664
+ .quad 0x03FDFFF020F07352E
+ .quad 0x03FE004738D910023 # 0.500543381211 665
+ .quad 0x03FE004738D910023
+ .quad 0x03FE00966D78C41CF # 0.501147671692 666
+ .quad 0x03FE00966D78C41CF
+ .quad 0x03FE00E5AE5B207AB # 0.501752327560 667
+ .quad 0x03FE00E5AE5B207AB
+ .quad 0x03FE011A8B18F0ED6 # 0.502155634684 668
+ .quad 0x03FE011A8B18F0ED6
+ .quad 0x03FE0169E072D7311 # 0.502760900515 669
+ .quad 0x03FE0169E072D7311
+ .quad 0x03FE01B942198A5A1 # 0.503366532915 670
+ .quad 0x03FE01B942198A5A1
+ .quad 0x03FE0208B010DB642 # 0.503972532327 671
+ .quad 0x03FE0208B010DB642
+ .quad 0x03FE02582A5C9D122 # 0.504578899198 672
+ .quad 0x03FE02582A5C9D122
+ .quad 0x03FE02A7B100A3EF0 # 0.505185633972 673
+ .quad 0x03FE02A7B100A3EF0
+ .quad 0x03FE02F74400C64EA # 0.505792737097 674
+ .quad 0x03FE02F74400C64EA
+ .quad 0x03FE0346E360DC4F9 # 0.506400209020 675
+ .quad 0x03FE0346E360DC4F9
+ .quad 0x03FE03968F24BFDB6 # 0.507008050190 676
+ .quad 0x03FE03968F24BFDB6
+ .quad 0x03FE03E647504CA89 # 0.507616261055 677
+ .quad 0x03FE03E647504CA89
+ .quad 0x03FE04360BE7603AE # 0.508224842066 678
+ .quad 0x03FE04360BE7603AE
+ .quad 0x03FE046B4089BE0FD # 0.508630768599 679
+ .quad 0x03FE046B4089BE0FD
+ .quad 0x03FE04BB19DCA36B3 # 0.509239967521 680
+ .quad 0x03FE04BB19DCA36B3
+ .quad 0x03FE050AFFA5671A5 # 0.509849537793 681
+ .quad 0x03FE050AFFA5671A5
+ .quad 0x03FE055AF1E7ED47B # 0.510459479867 682
+ .quad 0x03FE055AF1E7ED47B
+ .quad 0x03FE05AAF0A81BF04 # 0.511069794198 683
+ .quad 0x03FE05AAF0A81BF04
+ .quad 0x03FE05FAFBE9DAE58 # 0.511680481240 684
+ .quad 0x03FE05FAFBE9DAE58
+ .quad 0x03FE064B13B113CDD # 0.512291541448 685
+ .quad 0x03FE064B13B113CDD
+ .quad 0x03FE069B3801B2263 # 0.512902975280 686
+ .quad 0x03FE069B3801B2263
+ .quad 0x03FE06D0AC85B63A2 # 0.513310805628 687
+ .quad 0x03FE06D0AC85B63A2
+ .quad 0x03FE0720E5C40DF1D # 0.513922863181 688
+ .quad 0x03FE0720E5C40DF1D
+ .quad 0x03FE07712B9648153 # 0.514535295577 689
+ .quad 0x03FE07712B9648153
+ .quad 0x03FE07C17E0056E7C # 0.515148103277 690
+ .quad 0x03FE07C17E0056E7C
+ .quad 0x03FE0811DD062E889 # 0.515761286740 691
+ .quad 0x03FE0811DD062E889
+ .quad 0x03FE086248ABC4F3B # 0.516374846428 692
+ .quad 0x03FE086248ABC4F3B
+ .quad 0x03FE08B2C0F512033 # 0.516988782802 693
+ .quad 0x03FE08B2C0F512033
+ .quad 0x03FE08E86D82DA3EE # 0.517398283218 694
+ .quad 0x03FE08E86D82DA3EE
+ .quad 0x03FE0938FAE5D8E9B # 0.518012848432 695
+ .quad 0x03FE0938FAE5D8E9B
+ .quad 0x03FE098994F72C539 # 0.518627791569 696
+ .quad 0x03FE098994F72C539
+ .quad 0x03FE09DA3BBAD339C # 0.519243113094 697
+ .quad 0x03FE09DA3BBAD339C
+ .quad 0x03FE0A2AEF34CE3D1 # 0.519858813473 698
+ .quad 0x03FE0A2AEF34CE3D1
+ .quad 0x03FE0A7BAF691FE34 # 0.520474893172 699
+ .quad 0x03FE0A7BAF691FE34
+ .quad 0x03FE0AB18BF5823C3 # 0.520885823936 700
+ .quad 0x03FE0AB18BF5823C3
+ .quad 0x03FE0B02616952989 # 0.521502536876 701
+ .quad 0x03FE0B02616952989
+ .quad 0x03FE0B5343A234476 # 0.522119630385 702
+ .quad 0x03FE0B5343A234476
+ .quad 0x03FE0BA432A430CA2 # 0.522737104934 703
+ .quad 0x03FE0BA432A430CA2
+ .quad 0x03FE0BF52E73538CE # 0.523354960993 704
+ .quad 0x03FE0BF52E73538CE
+ .quad 0x03FE0C463713A9E6F # 0.523973199034 705
+ .quad 0x03FE0C463713A9E6F
+ .quad 0x03FE0C7C43F4C861E # 0.524385570174 706
+ .quad 0x03FE0C7C43F4C861E
+ .quad 0x03FE0CCD61FAD07D2 # 0.525004445903 707
+ .quad 0x03FE0CCD61FAD07D2
+ .quad 0x03FE0D1E8CDCE3DB6 # 0.525623704876 708
+ .quad 0x03FE0D1E8CDCE3DB6
+ .quad 0x03FE0D6FC49F16E93 # 0.526243347569 709
+ .quad 0x03FE0D6FC49F16E93
+ .quad 0x03FE0DC109458004A # 0.526863374456 710
+ .quad 0x03FE0DC109458004A
+ .quad 0x03FE0DF73E353F0ED # 0.527276939392 711
+ .quad 0x03FE0DF73E353F0ED
+ .quad 0x03FE0E4898611CCE1 # 0.527897607665 712
+ .quad 0x03FE0E4898611CCE1
+ .quad 0x03FE0E99FF7C20738 # 0.528518661406 713
+ .quad 0x03FE0E99FF7C20738
+ .quad 0x03FE0EEB738A67874 # 0.529140101094 714
+ .quad 0x03FE0EEB738A67874
+ .quad 0x03FE0F21C81D1ADC3 # 0.529554608872 715
+ .quad 0x03FE0F21C81D1ADC3
+ .quad 0x03FE0F7351C9FCD7F # 0.530176692874 716
+ .quad 0x03FE0F7351C9FCD7F
+ .quad 0x03FE0FC4E875254C1 # 0.530799164104 717
+ .quad 0x03FE0FC4E875254C1
+ .quad 0x03FE10168C22B8FB9 # 0.531422023047 718
+ .quad 0x03FE10168C22B8FB9
+ .quad 0x03FE10683CD6DEA54 # 0.532045270185 719
+ .quad 0x03FE10683CD6DEA54
+ .quad 0x03FE109EB9E2E4C97 # 0.532460984179 720
+ .quad 0x03FE109EB9E2E4C97
+ .quad 0x03FE10F08055E7785 # 0.533084879385 721
+ .quad 0x03FE10F08055E7785
+ .quad 0x03FE114253DA97DA0 # 0.533709164079 722
+ .quad 0x03FE114253DA97DA0
+ .quad 0x03FE1194347523FDC # 0.534333838748 723
+ .quad 0x03FE1194347523FDC
+ .quad 0x03FE11CAD1789B0F8 # 0.534750505421 724
+ .quad 0x03FE11CAD1789B0F8
+ .quad 0x03FE121CC7EB8F7E6 # 0.535375831132 725
+ .quad 0x03FE121CC7EB8F7E6
+ .quad 0x03FE126ECB7F8F007 # 0.536001548120 726
+ .quad 0x03FE126ECB7F8F007
+ .quad 0x03FE12A57FDA37091 # 0.536418910396 727
+ .quad 0x03FE12A57FDA37091
+ .quad 0x03FE12F799594EFBC # 0.537045280601 728
+ .quad 0x03FE12F799594EFBC
+ .quad 0x03FE1349C004AFB00 # 0.537672043392 729
+ .quad 0x03FE1349C004AFB00
+ .quad 0x03FE139BF3E094003 # 0.538299199261 730
+ .quad 0x03FE139BF3E094003
+ .quad 0x03FE13D2C873C5E13 # 0.538717521794 731
+ .quad 0x03FE13D2C873C5E13
+ .quad 0x03FE142512549C16C # 0.539345333889 732
+ .quad 0x03FE142512549C16C
+ .quad 0x03FE14776971477F1 # 0.539973540381 733
+ .quad 0x03FE14776971477F1
+ .quad 0x03FE14C9CDCE0A74D # 0.540602141763 734
+ .quad 0x03FE14C9CDCE0A74D
+ .quad 0x03FE1500C2BFD1561 # 0.541021428981 735
+ .quad 0x03FE1500C2BFD1561
+ .quad 0x03FE15533D3B8D7B3 # 0.541650689621 736
+ .quad 0x03FE15533D3B8D7B3
+ .quad 0x03FE15A5C502C6DC5 # 0.542280346478 737
+ .quad 0x03FE15A5C502C6DC5
+ .quad 0x03FE15DCD1973457B # 0.542700338085 738
+ .quad 0x03FE15DCD1973457B
+ .quad 0x03FE162F6F9071F76 # 0.543330656416 739
+ .quad 0x03FE162F6F9071F76
+ .quad 0x03FE16821AE0A13C6 # 0.543961372300 740
+ .quad 0x03FE16821AE0A13C6
+ .quad 0x03FE16B93F2C12808 # 0.544382070665 741
+ .quad 0x03FE16B93F2C12808
+ .quad 0x03FE170C00C169B51 # 0.545013450251 742
+ .quad 0x03FE170C00C169B51
+ .quad 0x03FE175ECFB935CC6 # 0.545645228728 743
+ .quad 0x03FE175ECFB935CC6
+ .quad 0x03FE17B1AC17CBD5B # 0.546277406602 744
+ .quad 0x03FE17B1AC17CBD5B
+ .quad 0x03FE17E8F12052E8A # 0.546699080654 745
+ .quad 0x03FE17E8F12052E8A
+ .quad 0x03FE183BE3DE8A7AF # 0.547331925312 746
+ .quad 0x03FE183BE3DE8A7AF
+ .quad 0x03FE188EE40F23CA7 # 0.547965170715 747
+ .quad 0x03FE188EE40F23CA7
+ .quad 0x03FE18C640FF75F06 # 0.548387557205 748
+ .quad 0x03FE18C640FF75F06
+ .quad 0x03FE191957A30FA51 # 0.549021471648 749
+ .quad 0x03FE191957A30FA51
+ .quad 0x03FE196C7BC4B1F3A # 0.549655788193 750
+ .quad 0x03FE196C7BC4B1F3A
+ .quad 0x03FE19A3F0B1860BD # 0.550078889532 751
+ .quad 0x03FE19A3F0B1860BD
+ .quad 0x03FE19F72B59A0CEC # 0.550713877383 752
+ .quad 0x03FE19F72B59A0CEC
+ .quad 0x03FE1A4A738B7A33C # 0.551349268700 753
+ .quad 0x03FE1A4A738B7A33C
+ .quad 0x03FE1A820089A2156 # 0.551773087312 754
+ .quad 0x03FE1A820089A2156
+ .quad 0x03FE1AD55F55855C8 # 0.552409152212 755
+ .quad 0x03FE1AD55F55855C8
+ .quad 0x03FE1B28CBB6EC93E # 0.553045621948 756
+ .quad 0x03FE1B28CBB6EC93E
+ .quad 0x03FE1B6070DB553D8 # 0.553470160269 757
+ .quad 0x03FE1B6070DB553D8
+ .quad 0x03FE1BB3F3EA714F6 # 0.554107305878 758
+ .quad 0x03FE1BB3F3EA714F6
+ .quad 0x03FE1BEBA8316EF2C # 0.554532295260 759
+ .quad 0x03FE1BEBA8316EF2C
+ .quad 0x03FE1C3F41FA97C6B # 0.555170118179 760
+ .quad 0x03FE1C3F41FA97C6B
+ .quad 0x03FE1C92E96C86020 # 0.555808348176 761
+ .quad 0x03FE1C92E96C86020
+ .quad 0x03FE1CCAB5FBFFEE1 # 0.556234061252 762
+ .quad 0x03FE1CCAB5FBFFEE1
+ .quad 0x03FE1D1E743BCFC47 # 0.556872970868 763
+ .quad 0x03FE1D1E743BCFC47
+ .quad 0x03FE1D72403052E75 # 0.557512288951 764
+ .quad 0x03FE1D72403052E75
+ .quad 0x03FE1DAA251D7E433 # 0.557938728190 765
+ .quad 0x03FE1DAA251D7E433
+ .quad 0x03FE1DFE07F3D1DAB # 0.558578728212 766
+ .quad 0x03FE1DFE07F3D1DAB
+ .quad 0x03FE1E35FC265D75E # 0.559005622562 767
+ .quad 0x03FE1E35FC265D75E
+ .quad 0x03FE1E89F5EB04126 # 0.559646305979 768
+ .quad 0x03FE1E89F5EB04126
+ .quad 0x03FE1EDDFD77E1FEF # 0.560287400135 769
+ .quad 0x03FE1EDDFD77E1FEF
+ .quad 0x03FE1F160A2AD0DA3 # 0.560715024687 770
+ .quad 0x03FE1F160A2AD0DA3
+ .quad 0x03FE1F6A28BA1B476 # 0.561356804579 771
+ .quad 0x03FE1F6A28BA1B476
+ .quad 0x03FE1FBE551DB43C1 # 0.561998996616 772
+ .quad 0x03FE1FBE551DB43C1
+ .quad 0x03FE1FF67A6684F47 # 0.562427353873 773
+ .quad 0x03FE1FF67A6684F47
+ .quad 0x03FE204ABDE0BE5DF # 0.563070233998 774
+ .quad 0x03FE204ABDE0BE5DF
+ .quad 0x03FE2082F29233211 # 0.563499050471 775
+ .quad 0x03FE2082F29233211
+ .quad 0x03FE20D74D2FBAFE4 # 0.564142620160 776
+ .quad 0x03FE20D74D2FBAFE4
+ .quad 0x03FE210F91524B469 # 0.564571896835 777
+ .quad 0x03FE210F91524B469
+ .quad 0x03FE2164031FDA0B0 # 0.565216157568 778
+ .quad 0x03FE2164031FDA0B0
+ .quad 0x03FE21B882DD26040 # 0.565860833641 779
+ .quad 0x03FE21B882DD26040
+ .quad 0x03FE21F0DFC65CEEC # 0.566290848698 780
+ .quad 0x03FE21F0DFC65CEEC
+ .quad 0x03FE224576C81FFE0 # 0.566936218194 781
+ .quad 0x03FE224576C81FFE0
+ .quad 0x03FE227DE33896A44 # 0.567366696031 782
+ .quad 0x03FE227DE33896A44
+ .quad 0x03FE22D2918BA4A31 # 0.568012760445 783
+ .quad 0x03FE22D2918BA4A31
+ .quad 0x03FE23274DE272A83 # 0.568659242528 784
+ .quad 0x03FE23274DE272A83
+ .quad 0x03FE235FD33D232FC # 0.569090462888 785
+ .quad 0x03FE235FD33D232FC
+ .quad 0x03FE23B4A6F9D8688 # 0.569737642287 786
+ .quad 0x03FE23B4A6F9D8688
+ .quad 0x03FE23ED3BF21CA33 # 0.570169328026 787
+ .quad 0x03FE23ED3BF21CA33
+ .quad 0x03FE24422721A89D7 # 0.570817206248 788
+ .quad 0x03FE24422721A89D7
+ .quad 0x03FE247ACBC023D2B # 0.571249358372 789
+ .quad 0x03FE247ACBC023D2B
+ .quad 0x03FE24CFCE6F80D9B # 0.571897936927 790
+ .quad 0x03FE24CFCE6F80D9B
+ .quad 0x03FE250882BCDD7D8 # 0.572330556445 791
+ .quad 0x03FE250882BCDD7D8
+ .quad 0x03FE255D9CF910A56 # 0.572979836849 792
+ .quad 0x03FE255D9CF910A56
+ .quad 0x03FE25B2C55CD5762 # 0.573629539091 793
+ .quad 0x03FE25B2C55CD5762
+ .quad 0x03FE25EB92D41992D # 0.574062908546 794
+ .quad 0x03FE25EB92D41992D
+ .quad 0x03FE2640D2D99FFEA # 0.574713315073 795
+ .quad 0x03FE2640D2D99FFEA
+ .quad 0x03FE2679B0166F51C # 0.575147154559 796
+ .quad 0x03FE2679B0166F51C
+ .quad 0x03FE26CF07CAD8B00 # 0.575798266899 797
+ .quad 0x03FE26CF07CAD8B00
+ .quad 0x03FE2707F4D5F7C40 # 0.576232577438 798
+ .quad 0x03FE2707F4D5F7C40
+ .quad 0x03FE275D644670606 # 0.576884397124 799
+ .quad 0x03FE275D644670606
+ .quad 0x03FE27966128AB11B # 0.577319179739 800
+ .quad 0x03FE27966128AB11B
+ .quad 0x03FE27EBE8626A387 # 0.577971708311 801
+ .quad 0x03FE27EBE8626A387
+ .quad 0x03FE2824F52493BD2 # 0.578406964030 802
+ .quad 0x03FE2824F52493BD2
+ .quad 0x03FE287A9434DBC7B # 0.579060203030 803
+ .quad 0x03FE287A9434DBC7B
+ .quad 0x03FE28B3B0DFCEB80 # 0.579495932884 804
+ .quad 0x03FE28B3B0DFCEB80
+ .quad 0x03FE290967D3ED18D # 0.580149883861 805
+ .quad 0x03FE290967D3ED18D
+ .quad 0x03FE294294708B773 # 0.580586088885 806
+ .quad 0x03FE294294708B773
+ .quad 0x03FE29986355D8C69 # 0.581240753393 807
+ .quad 0x03FE29986355D8C69
+ .quad 0x03FE29D19FED0C082 # 0.581677434622 808
+ .quad 0x03FE29D19FED0C082
+ .quad 0x03FE2A2786D0EC107 # 0.582332814220 809
+ .quad 0x03FE2A2786D0EC107
+ .quad 0x03FE2A60D36BA5253 # 0.582769972697 810
+ .quad 0x03FE2A60D36BA5253
+ .quad 0x03FE2AB6D25B86EF7 # 0.583426068948 811
+ .quad 0x03FE2AB6D25B86EF7
+ .quad 0x03FE2AF02F02BE4AB # 0.583863705716 812
+ .quad 0x03FE2AF02F02BE4AB
+ .quad 0x03FE2B46460C1C2B3 # 0.584520520190 813
+ .quad 0x03FE2B46460C1C2B3
+ .quad 0x03FE2B7FB2C8D1CC1 # 0.584958636297 814
+ .quad 0x03FE2B7FB2C8D1CC1
+ .quad 0x03FE2BD5E1F9316F2 # 0.585616170568 815
+ .quad 0x03FE2BD5E1F9316F2
+ .quad 0x03FE2C0F5ED46CE8D # 0.586054767066 816
+ .quad 0x03FE2C0F5ED46CE8D
+ .quad 0x03FE2C65A6395F5F5 # 0.586713022712 817
+ .quad 0x03FE2C65A6395F5F5
+ .quad 0x03FE2C9F333C2FE1E # 0.587152100656 818
+ .quad 0x03FE2C9F333C2FE1E
+ .quad 0x03FE2CF592E351AE5 # 0.587811079263 819
+ .quad 0x03FE2CF592E351AE5
+ .quad 0x03FE2D2F3016CE0EF # 0.588250639709 820
+ .quad 0x03FE2D2F3016CE0EF
+ .quad 0x03FE2D85A80DC7324 # 0.588910342867 821
+ .quad 0x03FE2D85A80DC7324
+ .quad 0x03FE2DBF557B0DF43 # 0.589350386878 822
+ .quad 0x03FE2DBF557B0DF43
+ .quad 0x03FE2E15E5CF91FA7 # 0.590010816181 823
+ .quad 0x03FE2E15E5CF91FA7
+ .quad 0x03FE2E4FA37FC9577 # 0.590451344823 824
+ .quad 0x03FE2E4FA37FC9577
+ .quad 0x03FE2E8967B3BF4E1 # 0.590892067615 825
+ .quad 0x03FE2E8967B3BF4E1
+ .quad 0x03FE2EE01A3BED567 # 0.591553516212 826
+ .quad 0x03FE2EE01A3BED567
+ .quad 0x03FE2F19EEBFB00BA # 0.591994725131 827
+ .quad 0x03FE2F19EEBFB00BA
+ .quad 0x03FE2F70B9C67A7C2 # 0.592656903723 828
+ .quad 0x03FE2F70B9C67A7C2
+ .quad 0x03FE2FAA9EA342D04 # 0.593098599843 829
+ .quad 0x03FE2FAA9EA342D04
+ .quad 0x03FE3001823684D73 # 0.593761510043 830
+ .quad 0x03FE3001823684D73
+ .quad 0x03FE303B7775937EF # 0.594203694441 831
+ .quad 0x03FE303B7775937EF
+ .quad 0x03FE309273A3340FC # 0.594867337868 832
+ .quad 0x03FE309273A3340FC
+ .quad 0x03FE30CC794DD19D0 # 0.595310011625 833
+ .quad 0x03FE30CC794DD19D0
+ .quad 0x03FE3106858C76BB7 # 0.595752881428 834
+ .quad 0x03FE3106858C76BB7
+ .quad 0x03FE315DA4434068B # 0.596417554101 835
+ .quad 0x03FE315DA4434068B
+ .quad 0x03FE3197C0FA80E6A # 0.596860914783 836
+ .quad 0x03FE3197C0FA80E6A
+ .quad 0x03FE31EEF86D36EF1 # 0.597526324589 837
+ .quad 0x03FE31EEF86D36EF1
+ .quad 0x03FE322925A66E62D # 0.597970177237 838
+ .quad 0x03FE322925A66E62D
+ .quad 0x03FE328075E32022F # 0.598636325813 839
+ .quad 0x03FE328075E32022F
+ .quad 0x03FE32BAB3A7B21E9 # 0.599080671521 840
+ .quad 0x03FE32BAB3A7B21E9
+ .quad 0x03FE32F4F80D0B1BD # 0.599525214760 841
+ .quad 0x03FE32F4F80D0B1BD
+ .quad 0x03FE334C6B15D30DD # 0.600192400374 842
+ .quad 0x03FE334C6B15D30DD
+ .quad 0x03FE3386C013B90D6 # 0.600637438209 843
+ .quad 0x03FE3386C013B90D6
+ .quad 0x03FE33DE4C086C40A # 0.601305366543 844
+ .quad 0x03FE33DE4C086C40A
+ .quad 0x03FE3418B1A85622C # 0.601750900077 845
+ .quad 0x03FE3418B1A85622C
+ .quad 0x03FE34531DF21CFE3 # 0.602196632199 846
+ .quad 0x03FE34531DF21CFE3
+ .quad 0x03FE34AACCE299BA5 # 0.602865603124 847
+ .quad 0x03FE34AACCE299BA5
+ .quad 0x03FE34E549DBB21EF # 0.603311832493 848
+ .quad 0x03FE34E549DBB21EF
+ .quad 0x03FE353D11DA4F855 # 0.603981550121 849
+ .quad 0x03FE353D11DA4F855
+ .quad 0x03FE35779F8C43D6D # 0.604428277847 850
+ .quad 0x03FE35779F8C43D6D
+ .quad 0x03FE35B233F13DD4A # 0.604875205229 851
+ .quad 0x03FE35B233F13DD4A
+ .quad 0x03FE360A1F1BBA738 # 0.605545971045 852
+ .quad 0x03FE360A1F1BBA738
+ .quad 0x03FE3644C446F97BC # 0.605993398346 853
+ .quad 0x03FE3644C446F97BC
+ .quad 0x03FE367F702A9EA94 # 0.606441025927 854
+ .quad 0x03FE367F702A9EA94
+ .quad 0x03FE36D77E9D34FD7 # 0.607112843218 855
+ .quad 0x03FE36D77E9D34FD7
+ .quad 0x03FE37123B54987B7 # 0.607560972287 856
+ .quad 0x03FE37123B54987B7
+ .quad 0x03FE376A630C0A1D6 # 0.608233542652 857
+ .quad 0x03FE376A630C0A1D6
+ .quad 0x03FE37A530A0D5A31 # 0.608682174333 858
+ .quad 0x03FE37A530A0D5A31
+ .quad 0x03FE37E004F74E13B # 0.609131007374 859
+ .quad 0x03FE37E004F74E13B
+ .quad 0x03FE383850278CFD9 # 0.609804634884 860
+ .quad 0x03FE383850278CFD9
+ .quad 0x03FE3873356902AB7 # 0.610253972119 861
+ .quad 0x03FE3873356902AB7
+ .quad 0x03FE38AE2171976E8 # 0.610703511349 862
+ .quad 0x03FE38AE2171976E8
+ .quad 0x03FE390690373AFFF # 0.611378199331 863
+ .quad 0x03FE390690373AFFF
+ .quad 0x03FE39418D3872A53 # 0.611828244343 864
+ .quad 0x03FE39418D3872A53
+ .quad 0x03FE397C91064221F # 0.612278491987 865
+ .quad 0x03FE397C91064221F
+ .quad 0x03FE39D5237E045A5 # 0.612954243787 866
+ .quad 0x03FE39D5237E045A5
+ .quad 0x03FE3A1038522CE82 # 0.613404998809 867
+ .quad 0x03FE3A1038522CE82
+ .quad 0x03FE3A68E45AD354B # 0.614081512534 868
+ .quad 0x03FE3A68E45AD354B
+ .quad 0x03FE3AA40A3F2A68B # 0.614532776080 869
+ .quad 0x03FE3AA40A3F2A68B
+ .quad 0x03FE3ADF36F98A182 # 0.614984243356 870
+ .quad 0x03FE3ADF36F98A182
+ .quad 0x03FE3B3806E5DF340 # 0.615661826668 871
+ .quad 0x03FE3B3806E5DF340
+ .quad 0x03FE3B7344BE40311 # 0.616113804077 872
+ .quad 0x03FE3B7344BE40311
+ .quad 0x03FE3BAE897234A87 # 0.616565985862 873
+ .quad 0x03FE3BAE897234A87
+ .quad 0x03FE3C077D5F51881 # 0.617244642149 874
+ .quad 0x03FE3C077D5F51881
+ .quad 0x03FE3C42D33F2AE7B # 0.617697335683 875
+ .quad 0x03FE3C42D33F2AE7B
+ .quad 0x03FE3C7E30002960C # 0.618150234241 876
+ .quad 0x03FE3C7E30002960C
+ .quad 0x03FE3CD7480B4A8A3 # 0.618829966906 877
+ .quad 0x03FE3CD7480B4A8A3
+ .quad 0x03FE3D12B60622748 # 0.619283378838 878
+ .quad 0x03FE3D12B60622748
+ .quad 0x03FE3D4E2AE7B7E2B # 0.619736996447 879
+ .quad 0x03FE3D4E2AE7B7E2B
+ .quad 0x03FE3D89A6B1A558D # 0.620190819917 880
+ .quad 0x03FE3D89A6B1A558D
+ .quad 0x03FE3DE2ED57B1F9B # 0.620871941524 881
+ .quad 0x03FE3DE2ED57B1F9B
+ .quad 0x03FE3E1E7A6D8330E # 0.621326280468 882
+ .quad 0x03FE3E1E7A6D8330E
+ .quad 0x03FE3E5A0E714DA6E # 0.621780825931 883
+ .quad 0x03FE3E5A0E714DA6E
+ .quad 0x03FE3EB37978B85B6 # 0.622463031756 884
+ .quad 0x03FE3EB37978B85B6
+ .quad 0x03FE3EEF1ED68236B # 0.622918094335 885
+ .quad 0x03FE3EEF1ED68236B
+ .quad 0x03FE3F2ACB27ED6C7 # 0.623373364090 886
+ .quad 0x03FE3F2ACB27ED6C7
+ .quad 0x03FE3F845AAE68C81 # 0.624056657591 887
+ .quad 0x03FE3F845AAE68C81
+ .quad 0x03FE3FC0186800514 # 0.624512446113 888
+ .quad 0x03FE3FC0186800514
+ .quad 0x03FE3FFBDD1AE8406 # 0.624968442473 889
+ .quad 0x03FE3FFBDD1AE8406
+ .quad 0x03FE4037A8C8C197A # 0.625424646860 890
+ .quad 0x03FE4037A8C8C197A
+ .quad 0x03FE409167679DD99 # 0.626109343909 891
+ .quad 0x03FE409167679DD99
+ .quad 0x03FE40CD448FF6DD6 # 0.626566069196 892
+ .quad 0x03FE40CD448FF6DD6
+ .quad 0x03FE410928B8F950F # 0.627023003177 893
+ .quad 0x03FE410928B8F950F
+ .quad 0x03FE41630C1B50AFF # 0.627708795866 894
+ .quad 0x03FE41630C1B50AFF
+ .quad 0x03FE419F01CD27AD0 # 0.628166252416 895
+ .quad 0x03FE419F01CD27AD0
+ .quad 0x03FE41DAFE85672B9 # 0.628623918328 896
+ .quad 0x03FE41DAFE85672B9
+ .quad 0x03FE42170245B4C6A # 0.629081793794 897
+ .quad 0x03FE42170245B4C6A
+ .quad 0x03FE42711518DF546 # 0.629769000326 898
+ .quad 0x03FE42711518DF546
+ .quad 0x03FE42AD2A74888A0 # 0.630227400518 899
+ .quad 0x03FE42AD2A74888A0
+ .quad 0x03FE42E946DE080C0 # 0.630686010936 900
+ .quad 0x03FE42E946DE080C0
+ .quad 0x03FE43437EB9D9424 # 0.631374321162 901
+ .quad 0x03FE43437EB9D9424
+ .quad 0x03FE437FACCD31C10 # 0.631833457993 902
+ .quad 0x03FE437FACCD31C10
+ .quad 0x03FE43BBE1F42FE09 # 0.632292805727 903
+ .quad 0x03FE43BBE1F42FE09
+ .quad 0x03FE43F81E307DE5E # 0.632752364559 904
+ .quad 0x03FE43F81E307DE5E
+ .quad 0x03FE445285D68EA69 # 0.633442099038 905
+ .quad 0x03FE445285D68EA69
+ .quad 0x03FE448ED3CF71355 # 0.633902186463 906
+ .quad 0x03FE448ED3CF71355
+ .quad 0x03FE44CB28E37C3EE # 0.634362485666 907
+ .quad 0x03FE44CB28E37C3EE
+ .quad 0x03FE450785145CAFE # 0.634822996841 908
+ .quad 0x03FE450785145CAFE
+ .quad 0x03FE45621CB769366 # 0.635514161481 909
+ .quad 0x03FE45621CB769366
+ .quad 0x03FE459E8AB7B799D # 0.635975203444 910
+ .quad 0x03FE459E8AB7B799D
+ .quad 0x03FE45DAFFDABD4DB # 0.636436458065 911
+ .quad 0x03FE45DAFFDABD4DB
+ .quad 0x03FE46177C2229EC0 # 0.636897925539 912
+ .quad 0x03FE46177C2229EC0
+ .quad 0x03FE467243F53F69E # 0.637590526283 913
+ .quad 0x03FE467243F53F69E
+ .quad 0x03FE46AED21F117FC # 0.638052526753 914
+ .quad 0x03FE46AED21F117FC
+ .quad 0x03FE46EB677335D13 # 0.638514740766 915
+ .quad 0x03FE46EB677335D13
+ .quad 0x03FE472803F35EAAE # 0.638977168520 916
+ .quad 0x03FE472803F35EAAE
+ .quad 0x03FE4764A7A13EF3B # 0.639439810212 917
+ .quad 0x03FE4764A7A13EF3B
+ .quad 0x03FE47BFAA9F80271 # 0.640134174319 918
+ .quad 0x03FE47BFAA9F80271
+ .quad 0x03FE47FC60471DAF8 # 0.640597351724 919
+ .quad 0x03FE47FC60471DAF8
+ .quad 0x03FE48391D226992D # 0.641060743762 920
+ .quad 0x03FE48391D226992D
+ .quad 0x03FE4875E1331971E # 0.641524350631 921
+ .quad 0x03FE4875E1331971E
+ .quad 0x03FE48D114D3FB884 # 0.642220164181 922
+ .quad 0x03FE48D114D3FB884
+ .quad 0x03FE490DEAF1A3FC8 # 0.642684309003 923
+ .quad 0x03FE490DEAF1A3FC8
+ .quad 0x03FE494AC84AB0ED3 # 0.643148669355 924
+ .quad 0x03FE494AC84AB0ED3
+ .quad 0x03FE4987ACE0DABB0 # 0.643613245438 925
+ .quad 0x03FE4987ACE0DABB0
+ .quad 0x03FE49C498B5DA63F # 0.644078037452 926
+ .quad 0x03FE49C498B5DA63F
+ .quad 0x03FE4A20080EF10B2 # 0.644775630783 927
+ .quad 0x03FE4A20080EF10B2
+ .quad 0x03FE4A5D060894B8C # 0.645240963504 928
+ .quad 0x03FE4A5D060894B8C
+ .quad 0x03FE4A9A0B471A943 # 0.645706512861 929
+ .quad 0x03FE4A9A0B471A943
+ .quad 0x03FE4AD717CC3E626 # 0.646172279055 930
+ .quad 0x03FE4AD717CC3E626
+ .quad 0x03FE4B142B99BC871 # 0.646638262288 931
+ .quad 0x03FE4B142B99BC871
+ .quad 0x03FE4B6FD6F970C1F # 0.647337644529 932
+ .quad 0x03FE4B6FD6F970C1F
+ .quad 0x03FE4BACFD036D080 # 0.647804171246 933
+ .quad 0x03FE4BACFD036D080
+ .quad 0x03FE4BEA2A5BDBE87 # 0.648270915712 934
+ .quad 0x03FE4BEA2A5BDBE87
+ .quad 0x03FE4C275F047C956 # 0.648737878130 935
+ .quad 0x03FE4C275F047C956
+ .quad 0x03FE4C649AFF0EE16 # 0.649205058703 936
+ .quad 0x03FE4C649AFF0EE16
+ .quad 0x03FE4CC082B46485A # 0.649906239052 937
+ .quad 0x03FE4CC082B46485A
+ .quad 0x03FE4CFDD1037E37C # 0.650373965908 938
+ .quad 0x03FE4CFDD1037E37C
+ .quad 0x03FE4D3B26AAADDD9 # 0.650841911635 939
+ .quad 0x03FE4D3B26AAADDD9
+ .quad 0x03FE4D7883ABB61F6 # 0.651310076438 940
+ .quad 0x03FE4D7883ABB61F6
+ .quad 0x03FE4DB5E8085A477 # 0.651778460521 941
+ .quad 0x03FE4DB5E8085A477
+ .quad 0x03FE4DF353C25E42B # 0.652247064091 942
+ .quad 0x03FE4DF353C25E42B
+ .quad 0x03FE4E4F832C560DD # 0.652950381434 943
+ .quad 0x03FE4E4F832C560DD
+ .quad 0x03FE4E8D015786F16 # 0.653419534621 944
+ .quad 0x03FE4E8D015786F16
+ .quad 0x03FE4ECA86E64A683 # 0.653888908016 945
+ .quad 0x03FE4ECA86E64A683
+ .quad 0x03FE4F0813DA673DD # 0.654358501826 946
+ .quad 0x03FE4F0813DA673DD
+ .quad 0x03FE4F45A835A4E19 # 0.654828316258 947
+ .quad 0x03FE4F45A835A4E19
+ .quad 0x03FE4F8343F9CB678 # 0.655298351519 948
+ .quad 0x03FE4F8343F9CB678
+ .quad 0x03FE4FDFBB88A119A # 0.656003818920 949
+ .quad 0x03FE4FDFBB88A119A
+ .quad 0x03FE501D69DADD660 # 0.656474407164 950
+ .quad 0x03FE501D69DADD660
+ .quad 0x03FE505B1F9C43ED7 # 0.656945216966 951
+ .quad 0x03FE505B1F9C43ED7
+ .quad 0x03FE5098DCCE9FABA # 0.657416248534 952
+ .quad 0x03FE5098DCCE9FABA
+ .quad 0x03FE50D6A173BC425 # 0.657887502077 953
+ .quad 0x03FE50D6A173BC425
+ .quad 0x03FE51146D8D65F98 # 0.658358977805 954
+ .quad 0x03FE51146D8D65F98
+ .quad 0x03FE5152411D69C03 # 0.658830675927 955
+ .quad 0x03FE5152411D69C03
+ .quad 0x03FE51AF0C774A2D0 # 0.659538640558 956
+ .quad 0x03FE51AF0C774A2D0
+ .quad 0x03FE51ECF2B713F8A # 0.660010895584 957
+ .quad 0x03FE51ECF2B713F8A
+ .quad 0x03FE522AE0738A3D8 # 0.660483373741 958
+ .quad 0x03FE522AE0738A3D8
+ .quad 0x03FE5268D5AE7CDCB # 0.660956075239 959
+ .quad 0x03FE5268D5AE7CDCB
+ .quad 0x03FE52A6D269BC600 # 0.661429000289 960
+ .quad 0x03FE52A6D269BC600
+ .quad 0x03FE52E4D6A719F9B # 0.661902149103 961
+ .quad 0x03FE52E4D6A719F9B
+ .quad 0x03FE5322E26867857 # 0.662375521893 962
+ .quad 0x03FE5322E26867857
+ .quad 0x03FE53800225BA6E2 # 0.663086001497 963
+ .quad 0x03FE53800225BA6E2
+ .quad 0x03FE53BE20B8DA502 # 0.663559935155 964
+ .quad 0x03FE53BE20B8DA502
+ .quad 0x03FE53FC46D64DDD1 # 0.664034093533 965
+ .quad 0x03FE53FC46D64DDD1
+ .quad 0x03FE543A747FE9ED6 # 0.664508476843 966
+ .quad 0x03FE543A747FE9ED6
+ .quad 0x03FE5478A9B78404C # 0.664983085300 967
+ .quad 0x03FE5478A9B78404C
+ .quad 0x03FE54B6E67EF251C # 0.665457919117 968
+ .quad 0x03FE54B6E67EF251C
+ .quad 0x03FE54F52AD80BAE9 # 0.665932978509 969
+ .quad 0x03FE54F52AD80BAE9
+ .quad 0x03FE553376C4A7A16 # 0.666408263689 970
+ .quad 0x03FE553376C4A7A16
+ .quad 0x03FE5571CA469E5C9 # 0.666883774872 971
+ .quad 0x03FE5571CA469E5C9
+ .quad 0x03FE55CF55C5A5437 # 0.667597465874 972
+ .quad 0x03FE55CF55C5A5437
+ .quad 0x03FE560DBC45153C7 # 0.668073543008 973
+ .quad 0x03FE560DBC45153C7
+ .quad 0x03FE564C2A6059FE7 # 0.668549846899 974
+ .quad 0x03FE564C2A6059FE7
+ .quad 0x03FE568AA0194EC6E # 0.669026377763 975
+ .quad 0x03FE568AA0194EC6E
+ .quad 0x03FE56C91D71CF810 # 0.669503135817 976
+ .quad 0x03FE56C91D71CF810
+ .quad 0x03FE5707A26BB8C66 # 0.669980121278 977
+ .quad 0x03FE5707A26BB8C66
+ .quad 0x03FE57462F08E7DF5 # 0.670457334363 978
+ .quad 0x03FE57462F08E7DF5
+ .quad 0x03FE5784C34B3AC30 # 0.670934775289 979
+ .quad 0x03FE5784C34B3AC30
+ .quad 0x03FE57C35F3490183 # 0.671412444273 980
+ .quad 0x03FE57C35F3490183
+ .quad 0x03FE580202C6C7353 # 0.671890341535 981
+ .quad 0x03FE580202C6C7353
+ .quad 0x03FE5840AE03C0204 # 0.672368467291 982
+ .quad 0x03FE5840AE03C0204
+ .quad 0x03FE589EBD437CA31 # 0.673086084831 983
+ .quad 0x03FE589EBD437CA31
+ .quad 0x03FE58DD7BB392B30 # 0.673564782782 984
+ .quad 0x03FE58DD7BB392B30
+ .quad 0x03FE591C41D500163 # 0.674043709994 985
+ .quad 0x03FE591C41D500163
+ .quad 0x03FE595B0FA9A7EF1 # 0.674522866688 986
+ .quad 0x03FE595B0FA9A7EF1
+ .quad 0x03FE5999E5336E121 # 0.675002253082 987
+ .quad 0x03FE5999E5336E121
+ .quad 0x03FE59D8C2743705E # 0.675481869398 988
+ .quad 0x03FE59D8C2743705E
+ .quad 0x03FE5A17A76DE803B # 0.675961715857 989
+ .quad 0x03FE5A17A76DE803B
+ .quad 0x03FE5A56942266F7B # 0.676441792678 990
+ .quad 0x03FE5A56942266F7B
+ .quad 0x03FE5A9588939A810 # 0.676922100084 991
+ .quad 0x03FE5A9588939A810
+ .quad 0x03FE5AD484C369F2D # 0.677402638296 992
+ .quad 0x03FE5AD484C369F2D
+ .quad 0x03FE5B1388B3BD53E # 0.677883407536 993
+ .quad 0x03FE5B1388B3BD53E
+ .quad 0x03FE5B5294667D5F7 # 0.678364408027 994
+ .quad 0x03FE5B5294667D5F7
+ .quad 0x03FE5B91A7DD93852 # 0.678845639990 995
+ .quad 0x03FE5B91A7DD93852
+ .quad 0x03FE5BD0C31AE9E9D # 0.679327103649 996
+ .quad 0x03FE5BD0C31AE9E9D
+ .quad 0x03FE5C2F7A8ED5E5B # 0.680049734055 997
+ .quad 0x03FE5C2F7A8ED5E5B
+ .quad 0x03FE5C6EA94431EF9 # 0.680531777930 998
+ .quad 0x03FE5C6EA94431EF9
+ .quad 0x03FE5CADDFC6874F5 # 0.681014054284 999
+ .quad 0x03FE5CADDFC6874F5
+ .quad 0x03FE5CED1E17C35C6 # 0.681496563340 1000
+ .quad 0x03FE5CED1E17C35C6
+ .quad 0x03FE5D2C6439D4252 # 0.681979305324 1001
+ .quad 0x03FE5D2C6439D4252
+ .quad 0x03FE5D6BB22EA86F6 # 0.682462280460 1002
+ .quad 0x03FE5D6BB22EA86F6
+ .quad 0x03FE5DAB07F82FB84 # 0.682945488974 1003
+ .quad 0x03FE5DAB07F82FB84
+ .quad 0x03FE5DEA65985A350 # 0.683428931091 1004
+ .quad 0x03FE5DEA65985A350
+ .quad 0x03FE5E29CB1118D32 # 0.683912607038 1005
+ .quad 0x03FE5E29CB1118D32
+ .quad 0x03FE5E6938645D390 # 0.684396517040 1006
+ .quad 0x03FE5E6938645D390
+ .quad 0x03FE5EA8AD9419C5B # 0.684880661324 1007
+ .quad 0x03FE5EA8AD9419C5B
+ .quad 0x03FE5EE82AA241920 # 0.685365040118 1008
+ .quad 0x03FE5EE82AA241920
+ .quad 0x03FE5F27AF90C8705 # 0.685849653648 1009
+ .quad 0x03FE5F27AF90C8705
+ .quad 0x03FE5F673C61A2ED2 # 0.686334502142 1010
+ .quad 0x03FE5F673C61A2ED2
+ .quad 0x03FE5FA6D116C64F7 # 0.686819585829 1011
+ .quad 0x03FE5FA6D116C64F7
+ .quad 0x03FE5FE66DB228992 # 0.687304904936 1012
+ .quad 0x03FE5FE66DB228992
+ .quad 0x03FE60261235C0874 # 0.687790459692 1013
+ .quad 0x03FE60261235C0874
+ .quad 0x03FE6065BEA385926 # 0.688276250325 1014
+ .quad 0x03FE6065BEA385926
+ .quad 0x03FE60A572FD6FEF1 # 0.688762277066 1015
+ .quad 0x03FE60A572FD6FEF1
+ .quad 0x03FE60E52F45788E4 # 0.689248540144 1016
+ .quad 0x03FE60E52F45788E4
+ .quad 0x03FE6124F37D991D4 # 0.689735039789 1017
+ .quad 0x03FE6124F37D991D4
+ .quad 0x03FE6164BFA7CC06C # 0.690221776231 1018
+ .quad 0x03FE6164BFA7CC06C
+ .quad 0x03FE61A493C60C729 # 0.690708749700 1019
+ .quad 0x03FE61A493C60C729
+ .quad 0x03FE61E46FDA56466 # 0.691195960429 1020
+ .quad 0x03FE61E46FDA56466
+ .quad 0x03FE622453E6A6263 # 0.691683408647 1021
+ .quad 0x03FE622453E6A6263
+ .quad 0x03FE62643FECF9743 # 0.692171094587 1022
+ .quad 0x03FE62643FECF9743
+ .quad 0x03FE62A433EF4E51A # 0.692659018480 1023
+ .quad 0x03FE62A433EF4E51A
+
+
+
diff --git a/src/gas/vrdasin.S b/src/gas/vrdasin.S
new file mode 100644
index 0000000..a5fb8d4
--- /dev/null
+++ b/src/gas/vrdasin.S
@@ -0,0 +1,3073 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdasin.s
+#
+# An array implementation of the sin libm function.
+#
+# Prototype:
+#
+# void vrda_sin(int n, double *x, double *y);
+#
+#Computes Sine of x for an array of input values.
+#Places the results into the supplied y array.
+#Does not perform error checking.
+#Denormal inputs may produce unexpected results
+#Author: Harsha Jagasia
+#Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.Lcosarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03fa5555555555555
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6
+ .quad 0x0bda907db46cc5e42
+.Lsinarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bfc5555555555555
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03f81111111110bb3
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0bf2a01a019e83e5c
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03ec71de3796cde01
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0be5ae600b42fdfa7
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x03de5e0b2f9a43bb8
+.Lsincosarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x0bda907db46cc5e42
+.Lcossinarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0bda907db46cc5e42
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+
+.Levensin_oddcos_tbl:
+ .quad .Lsinsin_sinsin_piby4 # 0
+ .quad .Lsinsin_sincos_piby4 # 1
+ .quad .Lsinsin_cossin_piby4 # 2
+ .quad .Lsinsin_coscos_piby4 # 3
+
+ .quad .Lsincos_sinsin_piby4 # 4
+ .quad .Lsincos_sincos_piby4 # 5
+ .quad .Lsincos_cossin_piby4 # 6
+ .quad .Lsincos_coscos_piby4 # 7
+
+ .quad .Lcossin_sinsin_piby4 # 8
+ .quad .Lcossin_sincos_piby4 # 9
+ .quad .Lcossin_cossin_piby4 # 10
+ .quad .Lcossin_coscos_piby4 # 11
+
+ .quad .Lcoscos_sinsin_piby4 # 12
+ .quad .Lcoscos_sincos_piby4 # 13
+ .quad .Lcoscos_cossin_piby4 # 14
+ .quad .Lcoscos_coscos_piby4 # 15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .weak vrda_sin_
+ .set vrda_sin_,__vrda_sin__
+ .weak vrda_sin__
+ .set vrda_sin__,__vrda_sin__
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array sin
+#** VRDA_SIN(N,X,Y)
+# C equivalent*/
+#void vrda_sin__(int * n, double *x, double *y)
+#{
+# vrda_sin(*n,x,y);
+#}
+.globl __vrda_sin__
+ .type __vrda_sin__,@function
+__vrda_sin__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_temp, 0x00 # temporary for get/put bits operation
+.equ p_temp1, 0x10 # temporary for get/put bits operation
+
+.equ save_xmm6, 0x20 # temporary for get/put bits operation
+.equ save_xmm7, 0x30 # temporary for get/put bits operation
+.equ save_xmm8, 0x40 # temporary for get/put bits operation
+.equ save_xmm9, 0x50 # temporary for get/put bits operation
+.equ save_xmm10, 0x60 # temporary for get/put bits operation
+.equ save_xmm11, 0x70 # temporary for get/put bits operation
+.equ save_xmm12, 0x80 # temporary for get/put bits operation
+.equ save_xmm13, 0x90 # temporary for get/put bits operation
+.equ save_xmm14, 0x0A0 # temporary for get/put bits operation
+.equ save_xmm15, 0x0B0 # temporary for get/put bits operation
+
+.equ r, 0x0C0 # pointer to r for remainder_piby2
+.equ rr, 0x0D0 # pointer to r for remainder_piby2
+.equ region, 0x0E0 # pointer to r for remainder_piby2
+
+.equ r1, 0x0F0 # pointer to r for remainder_piby2
+.equ rr1, 0x0100 # pointer to r for remainder_piby2
+.equ region1, 0x0110 # pointer to r for remainder_piby2
+
+.equ p_temp2, 0x0120 # temporary for get/put bits operation
+.equ p_temp3, 0x0130 # temporary for get/put bits operation
+
+.equ p_temp4, 0x0140 # temporary for get/put bits operation
+.equ p_temp5, 0x0150 # temporary for get/put bits operation
+
+.equ p_original, 0x0160 # original x
+.equ p_mask, 0x0170 # original x
+.equ p_sign, 0x0180 # original x
+
+.equ p_original1, 0x0190 # original x
+.equ p_mask1, 0x01A0 # original x
+.equ p_sign1, 0x01B0 # original x
+
+.equ save_r12, 0x01C0 # temporary for get/put bits operation
+.equ save_r13, 0x01D0 # temporary for get/put bits operation
+
+.equ save_xa, 0x01E0 #qword
+.equ save_ya, 0x01F0 #qword
+
+.equ save_nv, 0x0200 #qword
+.equ p_iter, 0x0210 # qword storage for number of loop iterations
+
+
+.globl vrda_sin
+ .type vrda_sin,@function
+vrda_sin:
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# parameters are passed in by Linux as:
+# rcx - int n
+# rdx - double *x
+# r8 - double *y
+
+ sub $0x228,%rsp
+ mov %r12,save_r12(%rsp) # save r12
+ mov %r13,save_r13(%rsp) # save r13
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START PROCESS INPUT
+
+# save the arguments
+ mov %rsi, save_xa(%rsp) # save x_array pointer
+ mov %rdx, save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+ mov %rdi,save_nv(%rsp) # save number of values
+
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vrda_cleanup # jump if only single calls
+
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrda_top:
+
+# build the input _m128d
+ movapd .L__real_7fffffffffffffff(%rip),%xmm2
+
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlpd (%rsi),%xmm0
+ movhpd 8(%rsi),%xmm0
+ mov (%rsi),%rax
+ mov 8(%rsi),%rcx
+ movdqa %xmm0,%xmm6
+
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+ movlpd -16(%rsi), %xmm1
+ movhpd -8(%rsi), %xmm1
+ mov -16(%rsi), %r8
+ mov -8(%rsi), %r9
+ movdqa %xmm1,%xmm7
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+andpd %xmm2,%xmm0 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+and .L__real_7fffffffffffffff(%rip), %rax
+and .L__real_7fffffffffffffff(%rip), %rcx
+and .L__real_7fffffffffffffff(%rip), %r8
+and .L__real_7fffffffffffffff(%rip), %r9
+
+movdqa %xmm0,%xmm12
+movdqa %xmm1,%xmm13
+
+pcmpgtd %xmm6,%xmm12
+pcmpgtd %xmm7,%xmm13
+movdqa %xmm12,%xmm6
+movdqa %xmm13,%xmm7
+psrldq $4,%xmm12
+psrldq $4,%xmm13
+psrldq $8,%xmm6
+psrldq $8,%xmm7
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use +
+
+por %xmm6,%xmm12
+por %xmm7,%xmm13
+movd %xmm12,%r12 #Move Sign to gpr **
+movd %xmm13,%r13 #Move Sign to gpr **
+
+movapd %xmm0,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm0,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm0
+ mulpd %xmm0,%xmm2 # * twobypi
+ mulpd %xmm0,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%r8 # Region
+ movd %xmm5,%r9 # Region
+
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+ mov %r8,%r10
+ mov %r9,%r11
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm0,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+ subpd %xmm0,%xmm6 #rr=rhead-r
+ subpd %xmm1,%xmm7 #rr=rhead-r
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm0,%xmm2 # move r for r2
+ movapd %xmm1,%xmm3 # move r for r2
+
+ mulpd %xmm0,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail
+ subpd %xmm9,%xmm7 #rr=(rhead-r) -rtail
+
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+#DEBUG
+# jmp .Lfinal_check
+#DEBUG
+
+ leaq .Levensin_oddcos_tbl(%rip),%rsi
+ jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm10, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+ movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r+8(%rsp) # store upper r
+ movlpd %xmm6,rr+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) # rr = 0
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+# mov p_original(r%sp),%rax
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) #rr = 0
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5:
+# mov p_original+8(%rsp),%rcx ;upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) #rr = 0
+ mov %r10d,region+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm10,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm0,%xmm0 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm2,%xmm2
+# movlhps %xmm6,%xmm6
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r(%rsp) # store upper r
+ movlpd %xmm6,rr(%rsp) # store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf:
+# mov p_original+8(%rsp),%rcx ; upper arg is nan/inf
+# mov r+8(%rsp),%rcx ; upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) # rr = 0
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm5,region1(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm1,%xmm7 # rhead
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+ subpd %xmm1,%xmm7 # rr=rhead-r
+ subpd %xmm9,%xmm7 # rr=(rhead-r) -rtail
+ movapd %xmm7,rr1(%rsp)
+
+ jmp .L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm10, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm4,region(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ movapd %xmm0,r(%rsp)
+
+ subpd %xmm0,%xmm6 # rr=rhead-r
+ subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail
+ movapd %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+ mov $0x411E848000000000,%r10 #5e5 +
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+ movsd %xmm7,%xmm1
+ subsd %xmm0,%xmm1 # xmm1 = r=(rhead-rtail)
+ subsd %xmm1,%xmm7 # rr=rhead-r
+ subsd %xmm0,%xmm7 # xmm7 = rr=((rhead-r) -rtail)
+ movlpd %xmm1,r1+8(%rsp) # store upper r
+ movlpd %xmm7,rr1+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr1(%rsp),%rsi
+ lea r1(%rsp),%rdi
+ movlpd r1(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf_higher:
+# mov p_original1(%rsp),%r8 ; upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1(%rsp) # rr = 0
+ mov %r10d,region1(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrd4_sin_reconstruct
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr1(%rsp),%rsi
+ lea r1(%rsp),%rdi
+ movsd %xmm1,%xmm0
+ call __amd_remainder_piby2@PLT
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+# mov p_original1(%rsp),%r8
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1(%rsp) #rr = 0
+ mov %r10d,region1(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr1+8(%rsp),%rsi
+ lea r1+8(%rsp),%rdi
+ movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher:
+# mov p_original1+8(%rsp),%r9 ;upper arg is nan/inf
+# movd %xmm6,%r9 ;upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1+8(%rsp) #rr = 0
+ mov %r10d,region1+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .L__vrd4_sin_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm4,region(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ movapd %xmm0,r(%rsp)
+
+ subpd %xmm0,%xmm6 # rr=rhead-r
+ subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail
+ movapd %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm1,%xmm1 ;Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm3,%xmm3
+# movlhps %xmm7,%xmm7
+
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+ movsd %xmm7,%xmm1
+ subsd %xmm0,%xmm1 # xmm0 = r=(rhead-rtail)
+ subsd %xmm1,%xmm7 # rr=rhead-r
+ subsd %xmm0,%xmm7 # xmm6 = rr=((rhead-r) -rtail)
+
+ movlpd %xmm1,r1(%rsp) # store upper r
+ movlpd %xmm7,rr1(%rsp) # store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr1+8(%rsp),%rsi
+ lea r1+8(%rsp),%rdi
+ movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf_higher:
+# mov p_original1+8(%rsp),%r9 ; upper arg is nan/inf
+# mov r1+8(%rsp),%r9 ; upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1+8(%rsp) # rr = 0
+ mov %r10d,region1+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd r(%rsp),%xmm0
+ movapd r1(%rsp),%xmm1
+
+ movapd rr(%rsp),%xmm6
+ movapd rr1(%rsp),%xmm7
+
+ mov region(%rsp),%r8
+ mov region1(%rsp),%r9
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+
+ mov %r8,%r10
+ mov %r9,%r11
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm0,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm0,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+ leaq .Levensin_oddcos_tbl(%rip),%rsi
+ jmp *(%rsi,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_cleanup:
+
+ movapd p_sign(%rsp), %xmm0
+ movapd p_sign1(%rsp), %xmm1
+ xorpd %xmm4, %xmm0 # (+) Sign
+ xorpd %xmm5, %xmm1 # (+) Sign
+
+.L__vrda_bottom1:
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movlpd %xmm0,(%rdi)
+ movhpd %xmm0,8(%rdi)
+
+.L__vrda_bottom2:
+ prefetch 64(%rdi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movlpd %xmm1, -16(%rdi)
+ movhpd %xmm1, -8(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vrda_top
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vrda_cleanup
+
+.L__final_check:
+
+ mov save_r12(%rsp),%r12 # restore r12
+ mov save_r13(%rsp),%r13 # restore r13
+
+ add $0x228,%rsp
+ ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# The number of values left is in save_nv
+
+.align 16
+.L__vrda_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorpd %xmm0,%xmm0
+ movlpd %xmm0,p_temp+8(%rsp)
+ movapd %xmm0,p_temp+16(%rsp)
+
+ mov (%rsi),%rcx # we know there's at least one
+ mov %rcx,p_temp(%rsp)
+ cmp $2,%rax
+ jl .L__vrdacg
+
+ mov 8(%rsi),%rcx # do the second value
+ mov %rcx,p_temp+8(%rsp)
+ cmp $3,%rax
+ jl .L__vrdacg
+
+ mov 16(%rsi),%rcx # do the third value
+ mov %rcx,p_temp+16(%rsp)
+
+.L__vrdacg:
+ mov $4,%rdi # parameter for N
+ lea p_temp(%rsp),%rsi # &x parameter
+ lea p_temp2(%rsp),%rdx # &y parameter
+ call vrda_sin@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p_temp2(%rsp),%rcx
+ mov %rcx, (%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vrdacgf
+
+ mov p_temp2+8(%rsp),%rcx
+ mov %rcx, 8(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vrdacgf
+
+ mov p_temp2+16(%rsp),%rcx
+ mov %rcx, 16(%rdi) # do the third value
+
+.L__vrdacgf:
+ jmp .L__final_check
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+
+
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm2 # x4 recalculate
+ mulpd %xmm3,%xmm3 # x4 recalculate
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subpd %xmm13,%xmm5 # + t
+
+ jmp .L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # s6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # s6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # s3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm10,%xmm10 # move high x4 for cos term
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd %xmm0,%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos
+ movhlps %xmm3,%xmm13 # move high r for cos
+ movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos
+ movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm7,%xmm5 # sin *x3
+ mulsd %xmm10,%xmm8 # cos *x4
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ mulsd p_temp(%rsp),%xmm2 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term
+ movsd %xmm12,%xmm6 # Keep high r for cos term
+ movsd %xmm13,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0
+
+ subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx
+ subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x*xx for cos term
+ movhlps %xmm1,%xmm11 # move high x for x*xx for cos term
+
+ mulsd p_temp+8(%rsp),%xmm10 # x * xx
+ mulsd p_temp1+8(%rsp),%xmm11 # x * xx
+
+ movsd %xmm12,%xmm2 # move -t for cos term
+ movsd %xmm13,%xmm3 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm12 #1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm13 #1+(-t)
+ addsd p_temp(%rsp),%xmm4 # sin+xx
+ addsd p_temp1(%rsp),%xmm5 # sin+xx
+ subsd %xmm6,%xmm12 # (1-t) - r
+ subsd %xmm7,%xmm13 # (1-t) - r
+ subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx
+ addsd %xmm0,%xmm4 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+ addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx)
+ addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx)
+ subsd %xmm2,%xmm8 # cos+t
+ subsd %xmm3,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4: # changed from sincos_sincos
+ # xmm1 is cossin and xmm0 is sincos
+
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm1,p_temp3(%rsp) # Store r for the sincos term
+
+ movapd .Lsincosarray+0x50(%rip),%xmm4 # s6
+ movapd .Lcossinarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lsincosarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm10,%xmm10 # move high x4 for cos term
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin)
+ movhlps %xmm3,%xmm7 # move high x2 for x3 for sin term (sincos)
+
+ mulsd %xmm0,%xmm6 # get low x3 for sin term
+ mulsd p_temp3+8(%rsp),%xmm7 # get high x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos (cossin)
+ movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term (sincos)
+
+ movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin)
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos)
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm11,%xmm5 # cos *x4
+ mulsd %xmm10,%xmm8 # cos *x4
+ mulsd %xmm7,%xmm9 # sin *x3
+
+ mulsd p_temp(%rsp),%xmm2 # low 0.5 * x2 * xx for sin term (cossin)
+ mulsd p_temp1+8(%rsp),%xmm13 # high 0.5 * x2 * xx for sin term (sincos)
+
+ movsd %xmm12,%xmm6 # Keep high r for cos term
+ movsd %xmm3,%xmm7 # Keep low r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0
+
+ subsd %xmm2,%xmm4 # sin - 0.5 * x2 *xx (cossin)
+ subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx (sincos)
+
+ movhlps %xmm0,%xmm10 # move high x for x*xx for cos term (cossin)
+ movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos)
+
+ mulsd p_temp+8(%rsp),%xmm10 # x * xx
+ mulsd p_temp1(%rsp),%xmm1 # x * xx
+
+ movsd %xmm12,%xmm2 # move -t for cos term
+ movsd %xmm3,%xmm13 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm12 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t)
+
+ addsd p_temp(%rsp),%xmm4 # sin+xx +
+ addsd p_temp1+8(%rsp),%xmm9 # sin+xx +
+
+ subsd %xmm6,%xmm12 # (1-t) - r
+ subsd %xmm7,%xmm3 # (1-t) - r
+
+ subsd %xmm10,%xmm12 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm0,%xmm4 # sin + x +
+ addsd %xmm11,%xmm9 # sin + x +
+
+ addsd %xmm12,%xmm8 # cos+((1-t)-r - x*xx)
+ addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm2,%xmm8 # cos+t
+ subsd %xmm13,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm0,p_temp2(%rsp) # Store r
+ movapd %xmm1,p_temp3(%rsp) # Store r
+
+
+ movapd .Lcossinarray+0x50(%rip),%xmm4 # s6
+ movapd .Lcossinarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lcossinarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movhlps %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term
+ mulsd p_temp3+8(%rsp),%xmm7 # get low x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term
+ movhlps %xmm3,%xmm13 # move high 0.5*x2 for sin term
+ # Reverse 12 and 2
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos
+
+ mulsd %xmm6,%xmm8 # sin *x3
+ mulsd %xmm7,%xmm9 # sin *x3
+ mulsd %xmm10,%xmm4 # cos *x4
+ mulsd %xmm11,%xmm5 # cos *x4
+
+ mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1+8(%rsp),%xmm13 # 0.5 * x2 * xx for sin term
+ movsd %xmm2,%xmm6 # Keep high r for cos term
+ movsd %xmm3,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0
+
+ subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx
+ subsd %xmm13,%xmm9 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x for sin term
+ # Reverse 10 and 0
+
+ mulsd p_temp(%rsp),%xmm0 # x * xx
+ mulsd p_temp1(%rsp),%xmm1 # x * xx
+
+ movsd %xmm2,%xmm12 # move -t for cos term
+ movsd %xmm3,%xmm13 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm3 # 1+(-t)
+ addsd p_temp+8(%rsp),%xmm8 # sin+xx
+ addsd p_temp1+8(%rsp),%xmm9 # sin+xx
+
+ subsd %xmm6,%xmm2 # (1-t) - r
+ subsd %xmm7,%xmm3 # (1-t) - r
+
+ subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm1,%xmm3 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm8 # sin + x
+ addsd %xmm11,%xmm9 # sin + x
+
+ addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx)
+ addsd %xmm3,%xmm5 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm12,%xmm4 # cos+t
+ subsd %xmm13,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_sincos_piby4: # changed from sincos_sincos
+ # xmm1 is cossin and xmm0 is sincos
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ movapd %xmm6,p_temp(%rsp) # Store rr
+ movapd %xmm7,p_temp1(%rsp) # Store rr
+ movapd %xmm0,p_temp2(%rsp) # Store r
+
+
+ movapd .Lcossinarray+0x50(%rip),%xmm4 # s6
+ movapd .Lsincosarray+0x50(%rip),%xmm5 # s6
+ movdqa .Lcossinarray+0x20(%rip),%xmm8 # s3
+ movdqa .Lsincosarray+0x20(%rip),%xmm9 # s3
+
+ movapd %xmm2,%xmm10 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # s2+x2s3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # s2+x2s3
+
+ movapd %xmm2,%xmm12 # move x2 for x6
+ movapd %xmm3,%xmm13 # move x2 for x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2s3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2s3)
+
+ mulpd %xmm10,%xmm12 # x6
+ mulpd %xmm11,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # s4+x2(s5+x2s6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # s4+x2(s5+x2s6)
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2(s2+x2s3)
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2(s2+x2s3)
+
+ movhlps %xmm11,%xmm11 # move high x4 for cos term +
+
+ mulpd %xmm12,%xmm4 # x6(s4+x2(s5+x2s6))
+ mulpd %xmm13,%xmm5 # x6(s4+x2(s5+x2s6))
+
+ movhlps %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term +
+ mulsd p_temp2+8(%rsp),%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term +
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high 0.5*x2 for sin term
+ movhlps %xmm3,%xmm13 # move high r for cos
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin
+
+ mulsd %xmm6,%xmm8 # sin *x3
+ mulsd %xmm11,%xmm9 # cos *x4
+ mulsd %xmm10,%xmm4 # cos *x4
+ mulsd %xmm7,%xmm5 # sin *x3
+
+ mulsd p_temp+8(%rsp),%xmm12 # 0.5 * x2 * xx for sin term
+ mulsd p_temp1(%rsp),%xmm3 # 0.5 * x2 * xx for sin term
+
+ movsd %xmm2,%xmm6 # Keep high r for cos term
+ movsd %xmm13,%xmm7 # Keep high r for cos term
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0
+
+ subsd %xmm12,%xmm8 # sin - 0.5 * x2 *xx
+ subsd %xmm3,%xmm5 # sin - 0.5 * x2 *xx
+
+ movhlps %xmm0,%xmm10 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x*xx for cos term
+
+ mulsd p_temp(%rsp),%xmm0 # x * xx
+ mulsd p_temp1+8(%rsp),%xmm11 # x * xx
+
+ movsd %xmm2,%xmm12 # move -t for cos term
+ movsd %xmm13,%xmm3 # move -t for cos term
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm2 # 1+(-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm13 # 1+(-t)
+
+ addsd p_temp+8(%rsp),%xmm8 # sin+xx
+ addsd p_temp1(%rsp),%xmm5 # sin+xx
+
+ subsd %xmm6,%xmm2 # (1-t) - r
+ subsd %xmm7,%xmm13 # (1-t) - r
+
+ subsd %xmm0,%xmm2 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm11,%xmm13 # ((1 + (-t)) - r) - x*xx
+
+
+ addsd %xmm10,%xmm8 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+
+ addsd %xmm2,%xmm4 # cos+((1-t)-r - x*xx)
+ addsd %xmm13,%xmm9 # cos+((1-t)-r - x*xx)
+
+ subsd %xmm12,%xmm4 # cos+t
+ subsd %xmm3,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+ jmp .L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ movapd %xmm2,p_temp2(%rsp) # store x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ movapd %xmm11,p_temp3(%rsp) # store r
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for 0.5*x2
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm10 # x6
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm12 # 0.5 *x2
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm0,%xmm2 # x3 recalculate
+ mulpd %xmm3,%xmm3 # x4 recalculate
+
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm6,%xmm12 # 0.5 * x2 *xx
+ mulpd %xmm1,%xmm7 # x * xx
+
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm12,%xmm4 # -0.5 * x2 *xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm6,%xmm4 # x3 * zs +xx
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+ addpd %xmm0,%xmm4 # +x
+ subpd %xmm13,%xmm5 # + t
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ movapd %xmm3,p_temp3(%rsp) # store x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ movapd %xmm10,p_temp2(%rsp) # store r
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ mulpd %xmm3,%xmm11 # x4
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for 0.5*x2
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm11 # x6
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd .L__real_3fe0000000000000(%rip),%xmm13 # 0.5 *x2
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm2 # x4 recalculate
+ mulpd %xmm1,%xmm3 # x3 recalculate
+
+ movapd p_temp2(%rsp),%xmm12 # r
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm7,%xmm13 # 0.5 * x2 *xx
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x3 * zs
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx;;;;;;;;;;;;;;;;;;;;;
+ subpd %xmm13,%xmm5 # -0.5 * x2 *xx
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm7,%xmm5 # +xx
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ addpd %xmm1,%xmm5 # +x
+ subpd %xmm12,%xmm4 # + t
+
+ jmp .L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4: #Derive from cossin_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+ movhlps %xmm10,%xmm10 # get upper r for t for cos
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ movsd %xmm0,%xmm8 # lower x for sin
+ mulsd %xmm2,%xmm8 # lower x3 for sin
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # upper x4 for cos
+ movsd %xmm8,%xmm2 # lower x3 for sin
+
+ movsd %xmm6,%xmm9 # lower xx
+ # note using odd reg
+
+ movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term
+ movapd p_temp3(%rsp),%xmm13 # r
+
+ mulpd %xmm0,%xmm6 # x * xx for upper cos term
+ mulpd %xmm1,%xmm7 # x * xx
+ movhlps %xmm6,%xmm6
+ mulsd p_temp2(%rsp),%xmm9 # xx * 0.5*x2 for sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+ # x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin
+
+ subsd %xmm9,%xmm4 # x3zs - 0.5*x2*xx
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp(%rsp),%xmm4 # +xx
+
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subsd %xmm12,%xmm8 # + t
+ addsd %xmm0,%xmm4 # +x
+ subpd %xmm13,%xmm5 # + t
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4: #Derive from sincos_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcossinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+ addpd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t)
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zszc
+ addpd %xmm9,%xmm5 # z
+
+ mulpd %xmm0,%xmm2 # upper x3 for sin
+ mulsd %xmm0,%xmm2 # lower x4 for cos
+ mulpd %xmm3,%xmm3 # x4
+
+ movhlps %xmm6,%xmm9 # upper xx for sin term
+ # note using odd reg
+
+ movlpd p_temp2(%rsp),%xmm12 # lower r for cos term
+ movapd p_temp3(%rsp),%xmm13 # r
+
+
+ mulpd %xmm0,%xmm6 # x * xx for lower cos term
+ mulpd %xmm1,%xmm7 # x * xx
+
+ mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ subpd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+ mulpd %xmm3,%xmm5
+ # x4 * zc
+
+ movhlps %xmm4,%xmm8 # xmm8= sin, xmm4= cos
+ subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx
+
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subpd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp+8(%rsp),%xmm8 # +xx
+
+ movhlps %xmm0,%xmm0 # upper x for sin
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subpd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subsd %xmm12,%xmm4 # + t
+ subpd %xmm13,%xmm5 # + t
+ addsd %xmm0,%xmm8 # +x
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+ movhlps %xmm11,%xmm11 # get upper r for t for cos
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) ;trash t
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) ;trash t
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zcs
+
+ movsd %xmm1,%xmm9 # lower x for sin
+ mulsd %xmm3,%xmm9 # lower x3 for sin
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # upper x4 for cos
+ movsd %xmm9,%xmm3 # lower x3 for sin
+
+ movsd %xmm7,%xmm8 # lower xx
+ # note using even reg
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx for upper cos term
+ movhlps %xmm7,%xmm7
+ mulsd p_temp3(%rsp),%xmm8 # xx * 0.5*x2 for sin term
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+ # x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin
+
+ subsd %xmm8,%xmm5 # x3zs - 0.5*x2*xx
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1(%rsp),%xmm5 # +xx
+
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # t relaculate, -t = r-1
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # t relaculate, -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subsd %xmm13,%xmm9 # + t
+ addsd %xmm1,%xmm5 # +x
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4: # Derived from sincos_sinsin
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsincosarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsincosarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ movhlps %xmm11,%xmm11
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsincosarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsincosarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm6,%xmm10 # 0.5*x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsincosarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zczs
+
+ movsd %xmm3,%xmm12
+ mulsd %xmm1,%xmm12 # low x3 for sin
+
+ mulpd %xmm0, %xmm2 # x3
+ mulpd %xmm3, %xmm3 # high x4 for cos
+ movsd %xmm12,%xmm3 # low x3 for sin
+
+ movhlps %xmm1,%xmm8 # upper x for cos term
+ # note using even reg
+ movlpd p_temp3+8(%rsp),%xmm13 # upper r for cos term
+
+ mulsd p_temp1+8(%rsp),%xmm8 # x * xx for upper cos term
+
+ mulsd p_temp3(%rsp),%xmm7 # xx * 0.5*x2 for lower sin term
+
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= cos, xmm5= sin
+
+ subsd %xmm7,%xmm5 # x3zs - 0.5*x2*xx
+
+ subsd %xmm8,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx
+ addsd %xmm11,%xmm9 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1(%rsp),%xmm5 # +xx
+
+ addpd %xmm6,%xmm4 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+
+ addsd %xmm1,%xmm5 # +x
+ addpd %xmm0,%xmm4 # +x
+ subsd %xmm13,%xmm9 # + t
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcossinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t)
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm7,%xmm8 # upper xx for sin term
+ # note using even reg
+
+ movapd p_temp2(%rsp),%xmm12 # r
+ movlpd p_temp3(%rsp),%xmm13 # lower r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx
+ mulpd %xmm1,%xmm7 # x * xx for lower cos term
+
+ mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term
+
+ subpd %xmm12,%xmm10 # (1 + (-t)) - r
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos
+
+ subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx
+
+ subpd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ addpd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1+8(%rsp),%xmm9 # +xx
+
+ movhlps %xmm1,%xmm1 # upper x for sin
+ subpd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ subpd %xmm12,%xmm4 # + t
+ subsd %xmm13,%xmm5 # + t
+ addsd %xmm1, %xmm9 # +x
+
+ movlhps %xmm9, %xmm5
+
+ jmp .L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsincos_sinsin_piby4: # Derived from sincos_coscos
+ movapd %xmm2,%xmm10 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lcossinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lcossinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm11,p_temp3(%rsp) # r
+ movapd %xmm7,p_temp1(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lcossinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lcossinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm6,%xmm10 # 0.5x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm11 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lcossinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm0,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm7,%xmm8 # upper xx for sin term
+ # note using even reg
+
+ movlpd p_temp3(%rsp),%xmm13 # lower r for cos term
+
+ mulpd %xmm1,%xmm7 # x * xx for lower cos term
+
+ mulsd p_temp3+8(%rsp),%xmm8 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm13,%xmm11 # (1 + (-t)) - r
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm5,%xmm9 # xmm9= sin, xmm5= cos
+
+ subsd %xmm8,%xmm9 # x3zs - 0.5*x2*xx
+
+ subsd %xmm7,%xmm11 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm10,%xmm4 # x3*zs - 0.5*x2*xx
+ addsd %xmm11,%xmm5 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp1+8(%rsp),%xmm9 # +xx
+
+ movhlps %xmm1,%xmm1 # upper x for sin
+ addpd %xmm6,%xmm4 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 # -t = r-1
+
+ addsd %xmm1,%xmm9 # +x
+ addpd %xmm0,%xmm4 # +x
+ subsd %xmm13,%xmm5 # + t
+
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsinsin_cossin_piby4: # Derived from sincos_sinsin
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsincosarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsincosarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # x2
+ movapd %xmm6,p_temp(%rsp) # xx
+
+ movhlps %xmm10,%xmm10
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lsincosarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsincosarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm7,%xmm11 # 0.5*x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lsincosarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsincosarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zs
+
+
+ movsd %xmm2,%xmm13
+ mulsd %xmm0,%xmm13 # low x3 for sin
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm2,%xmm2 # high x4 for cos
+ movsd %xmm13,%xmm2 # low x3 for sin
+
+
+ movhlps %xmm0,%xmm9 # upper x for cos term ; note using even reg
+ movlpd p_temp2+8(%rsp),%xmm12 # upper r for cos term
+ mulsd p_temp+8(%rsp),%xmm9 # x * xx for upper cos term
+ mulsd p_temp2(%rsp),%xmm6 # xx * 0.5*x2 for lower sin term
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm8= cos, xmm4= sin
+ subsd %xmm6,%xmm4 # x3zs - 0.5*x2*xx
+
+ subsd %xmm9,%xmm10 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx
+
+ addsd %xmm10,%xmm8 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp(%rsp),%xmm4 # +xx
+
+ addpd %xmm7,%xmm5 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+
+ addsd %xmm0,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+ subsd %xmm12,%xmm8 # + t
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_sin_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4: # Derived from sincos_coscos
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lcossinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lcossinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm10 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+
+ movapd %xmm10,p_temp2(%rsp) # r
+ movapd %xmm6,p_temp(%rsp) # rr
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0 for cos
+
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+
+ addpd .Lcossinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lcossinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm7,%xmm11 # 0.5x2*xx
+ addsd .L__real_3ff0000000000000(%rip),%xmm10 # 1 + (-t) for cos
+
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x6
+ mulpd %xmm3,%xmm13 # x6
+
+ addpd .Lcossinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lcossinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm12,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm13,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm0,%xmm2 # upper x3 for sin
+ mulsd %xmm0,%xmm2 # lower x4 for cos
+
+ movhlps %xmm6,%xmm9 # upper xx for sin term
+ # note using even reg
+
+ movlpd p_temp2(%rsp),%xmm12 # lower r for cos term
+
+ mulpd %xmm0,%xmm6 # x * xx for lower cos term
+
+ mulsd p_temp2+8(%rsp),%xmm9 # xx * 0.5*x2 for upper sin term
+
+ subsd %xmm12,%xmm10 # (1 + (-t)) - r
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ movhlps %xmm4,%xmm8 # xmm9= sin, xmm5= cos
+
+ subsd %xmm9,%xmm8 # x3zs - 0.5*x2*xx
+
+ subsd %xmm6,%xmm10 # ((1 + (-t)) - r) - x*xx
+
+ subpd %xmm11,%xmm5 # x3*zs - 0.5*x2*xx
+ addsd %xmm10,%xmm4 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addsd p_temp+8(%rsp),%xmm8 # +xx
+
+ movhlps %xmm0,%xmm0 # upper x for sin
+ addpd %xmm7,%xmm5 # +xx
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t = r-1
+
+
+ addsd %xmm0,%xmm8 # +x
+ addpd %xmm1,%xmm5 # +x
+ subsd %xmm12,%xmm4 # + t
+
+ movlhps %xmm8,%xmm4
+
+ jmp .L__vrd4_sin_cleanup
+
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#DEBUG
+# xorpd %xmm0, %xmm0
+# xorpd %xmm1, %xmm1
+# jmp .Lfinal_check
+#DEBUG
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # c6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # c6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # c3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # c3
+
+ movapd %xmm2,p_temp2(%rsp) # copy of x2
+ movapd %xmm3,p_temp3(%rsp) # copy of x2
+
+ mulpd %xmm2,%xmm4 # c6*x2
+ mulpd %xmm3,%xmm5 # c6*x2
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # c5+x2c6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # c5+x2c6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # c2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # c2+x2C3
+
+ mulpd %xmm2,%xmm10 # x6
+ mulpd %xmm3,%xmm11 # x6
+
+ mulpd %xmm2,%xmm4 # x2(c5+x2c6)
+ mulpd %xmm3,%xmm5 # x2(c5+x2c6)
+ mulpd %xmm2,%xmm8 # x2(c2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(c2+x2C3)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 *x2
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # c4 + x2(c5+x2c6)
+ addpd .Lsinarray(%rip),%xmm8 # c1 + x2(c2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm6,%xmm2 # 0.5 * x2 *xx
+ mulpd %xmm7,%xmm3 # 0.5 * x2 *xx
+
+ mulpd %xmm10,%xmm4 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm5 # x6(c4 + x2(c5+x2c6))
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zs
+
+ movapd p_temp2(%rsp),%xmm10 # x2
+ movapd p_temp3(%rsp),%xmm11 # x2
+
+ mulpd %xmm0,%xmm10 # x3
+ mulpd %xmm1,%xmm11 # x3
+
+ mulpd %xmm10,%xmm4 # x3 * zs
+ mulpd %xmm11,%xmm5 # x3 * zs
+
+ subpd %xmm2,%xmm4 # -0.5 * x2 *xx
+ subpd %xmm3,%xmm5 # -0.5 * x2 *xx
+
+ addpd %xmm6,%xmm4 # +xx
+ addpd %xmm7,%xmm5 # +xx
+
+ addpd %xmm0,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrd4_sin_cleanup
diff --git a/src/gas/vrdasincos.S b/src/gas/vrdasincos.S
new file mode 100644
index 0000000..d31e98a
--- /dev/null
+++ b/src/gas/vrdasincos.S
@@ -0,0 +1,1710 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrdasincos.s
+#
+# An array implementation of the sincos libm function.
+#
+# Prototype:
+#
+# void vrda_sincos(int n, double *x, double *ys, double *yc);
+#
+#Computes Sine of x for an array of input values.
+#Places the results into the supplied ys array.
+#Computes Cosine of x for an array of input values.
+#Places the results into the supplied yc array.
+#Does not perform error checking.
+#Denormal inputs may produce unexpected results
+#Author: Harsha Jagasia
+#Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 16
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__real_jt_mask: .quad 0x0000000000000000F #
+ .quad 0x00000000000000000 #
+.L__real_naninf_upper_sign_mask: .quad 0x000000000ffffffff #
+ .quad 0x000000000ffffffff #
+.L__real_naninf_lower_sign_mask: .quad 0x0ffffffff00000000 #
+ .quad 0x0ffffffff00000000 #
+
+.Lcosarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03fa5555555555555
+ .quad 0x0bf56c16c16c16967 # -0.00138889 c2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03efa01a019f4ec90 # 2.48016e-005 c3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0be927e4fa17f65f6 # -2.75573e-007 c4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03e21eeb69037ab78 # 2.08761e-009 c5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0bda907db46cc5e42 # -1.13826e-011 c6
+ .quad 0x0bda907db46cc5e42
+.Lsinarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bfc5555555555555
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03f81111111110bb3
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0bf2a01a019e83e5c
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03ec71de3796cde01
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0be5ae600b42fdfa7
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x03de5e0b2f9a43bb8
+.Lsincosarray:
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x0bf56c16c16c16967
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x03efa01a019f4ec90
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x03e21eeb69037ab78
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+ .quad 0x0bda907db46cc5e42
+
+
+.Lcossinarray:
+ .quad 0x03fa5555555555555 # 0.0416667 c1
+ .quad 0x0bfc5555555555555 # -0.166667 s1
+ .quad 0x0bf56c16c16c16967
+ .quad 0x03f81111111110bb3 # 0.00833333 s2
+ .quad 0x03efa01a019f4ec90
+ .quad 0x0bf2a01a019e83e5c # -0.000198413 s3
+ .quad 0x0be927e4fa17f65f6
+ .quad 0x03ec71de3796cde01 # 2.75573e-006 s4
+ .quad 0x03e21eeb69037ab78
+ .quad 0x0be5ae600b42fdfa7 # -2.50511e-008 s5
+ .quad 0x0bda907db46cc5e42
+ .quad 0x03de5e0b2f9a43bb8 # 1.59181e-010 s6
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .weak vrda_sincos_
+ .set vrda_sincos_,__vrda_sincos__
+ .weak vrda_sincos__
+ .set vrda_sincos__,__vrda_sincos__
+
+.text
+ .align 16
+ .p2align 4,,15
+
+#x/* a FORTRAN subroutine implementation of array sincos
+#** VRDA_SINCOS(N,X,YS,YC)
+# C equivalent*/
+#void vrda_sincos__( int * n, double *x, double *ys, double *yc)
+#{
+# vrda_sincos(*n,x,y);
+#}
+.globl __vrda_sincos__
+ .type __vrda_sincos__,@function
+__vrda_sincos__:
+ mov (%rdi),%edi
+.align 16
+.p2align 4,,15
+
+# define local variable storage offsets
+.equ save_xmm6, 0x00 # temporary for get/put bits operation
+.equ save_xmm7, 0x10 # temporary for get/put bits operation
+.equ save_xmm8, 0x20 # temporary for get/put bits operation
+.equ save_xmm9, 0x30 # temporary for get/put bits operation
+.equ save_xmm10, 0x40 # temporary for get/put bits operation
+.equ save_xmm11, 0x50 # temporary for get/put bits operation
+.equ save_xmm12, 0x60 # temporary for get/put bits operation
+.equ save_xmm13, 0x70 # temporary for get/put bits operation
+.equ save_xmm14, 0x80 # temporary for get/put bits operation
+.equ save_xmm15, 0x90 # temporary for get/put bits operation
+
+.equ save_rdi, 0x0A0
+.equ save_rsi, 0x0B0
+.equ save_rbx, 0x0C0
+
+.equ r, 0x0D0 # pointer to r for remainder_piby2
+.equ rr, 0x0E0 # pointer to r for remainder_piby2
+.equ rsq, 0x0F0
+.equ region, 0x0100 # pointer to r for remainder_piby2
+
+.equ r1, 0x0110 # pointer to r for remainder_piby2
+.equ rr1, 0x0120 # pointer to r for remainder_piby2
+.equ rsq1, 0x0130
+.equ region1, 0x0140 # pointer to r for remainder_piby2
+
+.equ p_temp, 0x0150 # temporary for get/put bits operation
+.equ p_temp1, 0x0160 # temporary for get/put bits operation
+
+.equ p_temp2, 0x0170 # temporary for get/put bits operation
+.equ p_temp3, 0x0180 # temporary for get/put bits operation
+
+.equ p_temp4, 0x0190 # temporary for get/put bits operation
+.equ p_temp5, 0x01A0 # temporary for get/put bits operation
+
+.equ p_temp6, 0x01B0 # temporary for get/put bits operation
+.equ p_temp7, 0x01C0 # temporary for get/put bits operation
+
+.equ p_original, 0x01D0 # original x
+.equ p_mask, 0x01E0 # original x
+.equ p_signs, 0x01F0 # original x
+.equ p_signc, 0x0200 # original x
+.equ p_region, 0x0210
+
+.equ p_original1, 0x0220 # original x
+.equ p_mask1, 0x0230 # original x
+.equ p_signs1, 0x0240 # original x
+.equ p_signc1, 0x0250 # original x
+.equ p_region1, 0x0260
+
+.equ save_r12, 0x0270 # temporary for get/put bits operation
+.equ save_r13, 0x0280 # temporary for get/put bits operation
+
+.equ save_r14, 0x0290 # temporary for get/put bits operation
+.equ save_r15, 0x02A0 # temporary for get/put bits operation
+
+.equ save_xa, 0x02B0 # qword ; leave space for 4 args*****
+.equ save_ysa, 0x02C0 # qword ; leave space for 4 args*****
+.equ save_yca, 0x02D0 # qword ; leave space for 4 args*****
+
+.equ save_nv, 0x02E0 # qword
+.equ p_iter, 0x02F0 # qword storage for number of loop iterations
+
+
+.globl vrda_sincos
+ .type vrda_sincos,@function
+vrda_sincos:
+
+ sub $0x0308,%rsp
+
+ mov %r12,save_r12(%rsp) # save r12
+ mov %r13,save_r13(%rsp) # save r13
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START PROCESS INPUT
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ysa(%rsp) # save ysin_array pointer
+ mov %rcx,save_yca(%rsp) # save ycos_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+
+ mov %rdi,save_nv(%rsp) # save number of values
+ # see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vrda_cleanup # jump if only single calls
+ # prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrda_top:
+# build the input _m128d
+ movapd .L__real_7fffffffffffffff(%rip),%xmm2
+ mov .L__real_7fffffffffffffff(%rip),%rdx
+
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlpd (%rsi),%xmm0
+ movhpd 8(%rsi),%xmm0
+ mov (%rsi),%rax
+ mov 8(%rsi),%rcx
+ movdqa %xmm0,%xmm6
+ movdqa %xmm0,p_original(%rsp)
+
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+ movlpd -16(%rsi), %xmm1
+ movhpd -8(%rsi), %xmm1
+ mov -16(%rsi), %r8
+ mov -8(%rsi), %r9
+ movdqa %xmm1,%xmm7
+ movdqa %xmm1,p_original1(%rsp)
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+andpd %xmm2,%xmm0 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+and %rdx,%rax
+and %rdx,%rcx
+and %rdx,%r8
+and %rdx,%r9
+
+movdqa %xmm0,%xmm12
+movdqa %xmm1,%xmm13
+
+pcmpgtd %xmm6,%xmm12
+pcmpgtd %xmm7,%xmm13
+movdqa %xmm12,%xmm6
+movdqa %xmm13,%xmm7
+psrldq $4,%xmm12
+psrldq $4,%xmm13
+psrldq $8,%xmm6
+psrldq $8,%xmm7
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use +
+
+por %xmm6,%xmm12
+por %xmm7,%xmm13
+movd %xmm12,%r12 #Move Sign to gpr **
+movd %xmm13,%r13 #Move Sign to gpr **
+
+movapd %xmm0,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm0,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5, xmm6 =x
+# xmm3 = x, xmm5 =0.5, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm0
+ mulpd %xmm0,%xmm2 # * twobypi
+ mulpd %xmm0,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+
+ xorpd %xmm12,%xmm12
+
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%r8 # Region
+ movd %xmm5,%r9 # Region
+
+ mov .L__reald_one_zero(%rip),%rdx # compare value for cossin path
+ mov %r8,%r10 # For Sign of Sin
+ mov %r9,%r11
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm0 =rhead, xmm8 =rtail
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+ pand .L__reald_one_one(%rip),%xmm4 #odd/even region for cos/sin
+ pand .L__reald_one_one(%rip),%xmm5 #odd/even region for cos/sin
+
+ pcmpeqd %xmm12,%xmm4
+ pcmpeqd %xmm12,%xmm5
+
+ punpckldq %xmm4,%xmm4
+ punpckldq %xmm5,%xmm5
+
+ movapd %xmm4,p_region(%rsp)
+ movapd %xmm5,p_region1(%rsp)
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_signs(%rsp) #write out lower sign bit
+ mov %r12,p_signs+8(%rsp) #write out upper sign bit
+ mov %r11,p_signs1(%rsp) #write out lower sign bit
+ mov %r13,p_signs1+8(%rsp) #write out upper sign bit
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm0 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm0,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm0 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+ subpd %xmm0,%xmm6 #rr=rhead-r
+ subpd %xmm1,%xmm7 #rr=rhead-r
+
+ movapd %xmm0,%xmm2 #move r for r2
+ movapd %xmm1,%xmm3 #move r for r2
+
+ mulpd %xmm0,%xmm2 #r2
+ mulpd %xmm1,%xmm3 #r2
+
+ subpd %xmm8,%xmm6 #rr=(rhead-r) -rtail
+ subpd %xmm9,%xmm7 #rr=(rhead-r) -rtail
+
+
+ add .L__reald_one_one(%rip),%r8
+ add .L__reald_one_one(%rip),%r9
+
+ and .L__reald_two_two(%rip),%r8
+ and .L__reald_two_two(%rip),%r9
+
+ shr $1,%r8
+ shr $1,%r9
+
+ mov %r8,%r12
+ mov %r9,%r13
+
+ and .L__reald_one_zero(%rip),%r12 #mask out the lower sign bit leaving the upper sign bit
+ and .L__reald_one_zero(%rip),%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r8 #shift lower sign bit left by 63 bits
+ shl $63,%r9 #shift lower sign bit left by 63 bits
+
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r8,p_signc(%rsp) #write out lower sign bit
+ mov %r12,p_signc+8(%rsp) #write out upper sign bit
+ mov %r9,p_signc1(%rsp) #write out lower sign bit
+ mov %r13,p_signc1+8(%rsp) #write out upper sign bit
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsinsin_sinsin_piby4:
+
+ movapd %xmm0,p_temp(%rsp) # copy of x
+ movapd %xmm1,p_temp1(%rsp) # copy of x
+
+ movapd %xmm2,%xmm10 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x50(%rip),%xmm4 # s6
+ movdqa .Lsinarray+0x50(%rip),%xmm5 # s6
+ movapd .Lsinarray+0x20(%rip),%xmm8 # s3
+ movapd .Lsinarray+0x20(%rip),%xmm9 # s3
+
+ movdqa .Lcosarray+0x50(%rip),%xmm12 # c6
+ movdqa .Lcosarray+0x50(%rip),%xmm13 # c6
+ movapd .Lcosarray+0x20(%rip),%xmm14 # c3
+ movapd .Lcosarray+0x20(%rip),%xmm15 # c3
+
+ movapd %xmm2,p_temp2(%rsp) # copy of x2
+ movapd %xmm3,p_temp3(%rsp) # copy of x2
+
+ mulpd %xmm2,%xmm4 # s6*x2
+ mulpd %xmm3,%xmm5 # s6*x2
+ mulpd %xmm2,%xmm8 # s3*x2
+ mulpd %xmm3,%xmm9 # s3*x2
+
+ mulpd %xmm2,%xmm12 # s6*x2
+ mulpd %xmm3,%xmm13 # s6*x2
+ mulpd %xmm2,%xmm14 # s3*x2
+ mulpd %xmm3,%xmm15 # s3*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsinarray+0x40(%rip),%xmm4 # s5+x2s6
+ addpd .Lsinarray+0x40(%rip),%xmm5 # s5+x2s6
+ addpd .Lsinarray+0x10(%rip),%xmm8 # s2+x2C3
+ addpd .Lsinarray+0x10(%rip),%xmm9 # s2+x2C3
+
+ addpd .Lcosarray+0x40(%rip),%xmm12 # c5+x2c6
+ addpd .Lcosarray+0x40(%rip),%xmm13 # c5+x2c6
+ addpd .Lcosarray+0x10(%rip),%xmm14 # c2+x2C3
+ addpd .Lcosarray+0x10(%rip),%xmm15 # c2+x2C3
+
+ mulpd %xmm2,%xmm10 # x6
+ mulpd %xmm3,%xmm11 # x6
+
+ mulpd %xmm2,%xmm4 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm5 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm8 # x2(s2+x2C3)
+ mulpd %xmm3,%xmm9 # x2(s2+x2C3)
+
+ mulpd %xmm2,%xmm12 # x2(s5+x2s6)
+ mulpd %xmm3,%xmm13 # x2(s5+x2s6)
+ mulpd %xmm2,%xmm14 # x2(s2+x2C3)
+ mulpd %xmm3,%xmm15 # x2(s2+x2C3)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5 *x2
+
+ addpd .Lsinarray+0x30(%rip),%xmm4 # s4 + x2(s5+x2s6)
+ addpd .Lsinarray+0x30(%rip),%xmm5 # s4 + x2(s5+x2s6)
+ addpd .Lsinarray(%rip),%xmm8 # s1 + x2(s2+x2C3)
+ addpd .Lsinarray(%rip),%xmm9 # s1 + x2(s2+x2C3)
+
+ movapd %xmm2,p_temp4(%rsp) # copy of r
+ movapd %xmm3,p_temp5(%rsp) # copy of r
+
+ movapd %xmm2,%xmm0 # r
+ movapd %xmm3,%xmm1 # r
+
+ addpd .Lcosarray+0x30(%rip),%xmm12 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray+0x30(%rip),%xmm13 # c4 + x2(c5+x2c6)
+ addpd .Lcosarray(%rip),%xmm14 # c1 + x2(c2+x2C3)
+ addpd .Lcosarray(%rip),%xmm15 # c1 + x2(c2+x2C3)
+
+ mulpd %xmm6,%xmm2 # 0.5 * x2 *xx
+ mulpd %xmm7,%xmm3 # 0.5 * x2 *xx
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0
+ subpd .L__real_3ff0000000000000(%rip),%xmm1 # -t=r-1.0
+
+ mulpd %xmm10,%xmm4 # x6(s4 + x2(s5+x2s6))
+ mulpd %xmm11,%xmm5 # x6(s4 + x2(s5+x2s6))
+
+ mulpd %xmm10,%xmm12 # x6(c4 + x2(c5+x2c6))
+ mulpd %xmm11,%xmm13 # x6(c4 + x2(c5+x2c6))
+
+ addpd .L__real_3ff0000000000000(%rip),%xmm0 # 1+(-t)
+ addpd .L__real_3ff0000000000000(%rip),%xmm1 # 1+(-t)
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zs
+
+ addpd %xmm14,%xmm12 # zc
+ addpd %xmm15,%xmm13 # zc
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = 0.5 * x2 *xx, xmm4 = zs, xmm12 = zc, xmm6 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = 0.5 * x2 *xx, xmm5 = zs, xmm13 = zc, xmm7 =rr
+
+# Free
+# %xmm8,,%xmm10 xmm14
+# %xmm9,,%xmm11 xmm15
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd p_temp2(%rsp),%xmm10 # x2 for x3
+ movapd p_temp3(%rsp),%xmm11 # x2 for x3
+
+ movapd %xmm10,%xmm8 # x2 for x4
+ movapd %xmm11,%xmm9 # x2 for x4
+
+ movapd p_temp(%rsp),%xmm14 # x for x*xx
+ movapd p_temp1(%rsp),%xmm15 # x for x*xx
+
+ subpd p_temp4(%rsp),%xmm0 # (1 + (-t)) - r
+ subpd p_temp5(%rsp),%xmm1 # (1 + (-t)) - r
+
+ mulpd %xmm14,%xmm10 # x3
+ mulpd %xmm15,%xmm11 # x3
+
+ mulpd %xmm8,%xmm8 # x4
+ mulpd %xmm9,%xmm9 # x4
+
+ mulpd %xmm6,%xmm14 # x*xx
+ mulpd %xmm7,%xmm15 # x*xx
+
+ mulpd %xmm10,%xmm4 # x3 * zs
+ mulpd %xmm11,%xmm5 # x3 * zs
+
+ mulpd %xmm8,%xmm12 # x4 * zc
+ mulpd %xmm9,%xmm13 # x4 * zc
+
+ subpd %xmm2,%xmm4 # x3*zs-0.5 * x2 *xx
+ subpd %xmm3,%xmm5 # x3*zs-0.5 * x2 *xx
+
+ subpd %xmm14,%xmm0 # ((1 + (-t)) - r) -x*xx
+ subpd %xmm15,%xmm1 # ((1 + (-t)) - r) -x*xx
+
+
+ movapd p_temp4(%rsp),%xmm10 # r for t
+ movapd p_temp5(%rsp),%xmm11 # r for t
+
+ addpd %xmm6,%xmm4 # sin+xx
+ addpd %xmm7,%xmm5 # sin+xx
+
+ addpd %xmm0,%xmm12 # x4*zc + (((1 + (-t)) - r) - x*xx)
+ addpd %xmm1,%xmm13 # x4*zc + (((1 + (-t)) - r) - x*xx)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm10 # -t=r-1.0
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+
+ movapd p_region(%rsp),%xmm2
+ movapd p_region1(%rsp),%xmm3
+
+ movapd %xmm2,%xmm8
+ movapd %xmm3,%xmm9
+
+ addpd p_temp(%rsp),%xmm4 # sin+xx+x
+ addpd p_temp1(%rsp),%xmm5 # sin+xx+x
+
+ subpd %xmm10,%xmm12 # cos + (-t)
+ subpd %xmm11,%xmm13 # cos + (-t)
+
+# xmm4 = sin, xmm5 = sin
+# xmm12 = cos, xmm13 = cos
+
+ andnpd %xmm4,%xmm8
+ andnpd %xmm5,%xmm9
+
+ andpd %xmm2,%xmm4
+ andpd %xmm3,%xmm5
+
+ andnpd %xmm12,%xmm2
+ andnpd %xmm13,%xmm3
+
+ andpd p_region(%rsp),%xmm12
+ andpd p_region1(%rsp),%xmm13
+
+ orpd %xmm2,%xmm4
+ orpd %xmm3,%xmm5
+
+ orpd %xmm8,%xmm12
+ orpd %xmm9,%xmm13
+
+ jmp .L__vrd4_sin_cleanup
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm10, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+ movlpd %xmm0,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm0,%xmm0 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r+8(%rsp) # store upper r
+ movlpd %xmm6,rr+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ movlpd r(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf:
+ mov p_original(%rsp),%rax # upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) # rr = 0
+ mov %r10d,region(%rsp) # region =0
+ and .L__real_naninf_lower_sign_mask(%rip),%r12 # Sign
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr(%rsp),%rsi
+ lea r(%rsp),%rdi
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+ mov p_original(%rsp),%rax
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr(%rsp) #rr = 0
+ mov %r10d,region(%rsp) #region = 0
+ and .L__real_naninf_lower_sign_mask(%rip),%r12 # Sign
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5:
+ mov p_original+8(%rsp),%rcx #upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) #rr = 0
+ mov %r10d,region+4(%rsp) #region = 0
+ and .L__real_naninf_upper_sign_mask(%rip),%r12 # Sign
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm10,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm0,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm0,%xmm0 #Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm2,%xmm2
+# movlhps %xmm6,%xmm6
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm10 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+ movsd %xmm6,%xmm0
+ subsd %xmm10,%xmm0 # xmm0 = r=(rhead-rtail)
+ subsd %xmm0,%xmm6 # rr=rhead-r
+ subsd %xmm10,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm0,r(%rsp) # store upper r
+ movlpd %xmm6,rr(%rsp) # store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf
+
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr+8(%rsp),%rsi
+ lea r+8(%rsp),%rdi
+ movlpd r+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf:
+ mov p_original+8(%rsp),%rcx # upper arg is nan/inf
+# mov r+8(%rsp),%rcx ; upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr+8(%rsp) # rr = 0
+ mov %r10d,region+4(%rsp) # region =0
+ and .L__real_naninf_upper_sign_mask(%rip),%r12 # Sign
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm5,region1(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm1,%xmm7 # rhead
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+ subpd %xmm1,%xmm7 # rr=rhead-r
+ subpd %xmm9,%xmm7 # rr=(rhead-r) -rtail
+ movapd %xmm7,rr1(%rsp)
+
+ jmp .L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm10, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+#DEBUG
+# movapd %xmm2, %xmm4
+# movapd %xmm1, %xmm5
+# movapd %xmm2, %xmm12
+# movapd %xmm1, %xmm13
+# jmp .L__vrd4_sin_cleanup
+#DEBUG
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm4,region(%rsp) # Region
+
+#DEBUG
+# movapd region(%rsp), %xmm4
+# movapd %xmm1, %xmm5
+# movapd region(%rsp), %xmm12
+# movapd %xmm1, %xmm13
+# jmp .L__vrd4_sin_cleanup
+#DEBUG
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ movapd %xmm0,r(%rsp)
+
+ subpd %xmm0,%xmm6 # rr=rhead-r
+ subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail
+ movapd %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+#DEBUG
+# movapd region(%rsp), %xmm4
+# movapd %xmm1, %xmm5
+# movapd region(%rsp), %xmm12
+# movapd %xmm1, %xmm13
+# jmp .L__vrd4_sin_cleanup
+#DEBUG
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+ movsd %xmm7,%xmm1
+ subsd %xmm0,%xmm1 # xmm1 = r=(rhead-rtail)
+ subsd %xmm1,%xmm7 # rr=rhead-r
+ subsd %xmm0,%xmm7 # xmm7 = rr=((rhead-r) -rtail)
+ movlpd %xmm1,r1+8(%rsp) # store upper r
+ movlpd %xmm7,rr1+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea rr1(%rsp),%rsi
+ lea r1(%rsp),%rdi
+ movlpd r1(%rsp),%xmm0 #Restore lower fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf_higher:
+ mov p_original1(%rsp),%r8 # upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1(%rsp) # rr = 0
+ mov %r10d,region1(%rsp) # region =0
+ and .L__real_naninf_lower_sign_mask(%rip),%r13 # Sign
+
+.align 16
+0:
+
+
+#DEBUG
+# movapd r(%rsp), %xmm4
+# movapd r1(%rsp), %xmm5
+# movapd r(%rsp), %xmm12
+# movapd r1(%rsp), %xmm13
+# jmp .L__vrd4_sin_cleanup
+#DEBUG
+
+
+ jmp .L__vrd4_sin_reconstruct
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea rr1(%rsp),%rsi
+ lea r1(%rsp),%rdi
+ movsd %xmm1,%xmm0
+ call __amd_remainder_piby2@PLT
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+
+ jmp 0f
+
+.L__vrd4_sin_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+ mov p_original1(%rsp),%r8
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1(%rsp) #rr = 0
+ mov %r10d,region1(%rsp) #region = 0
+ and .L__real_naninf_lower_sign_mask(%rip),%r13 # Sign
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea rr1+8(%rsp),%rsi
+ lea r1+8(%rsp),%rdi
+ movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf_of_both_gt_5e5_higher:
+ mov p_original1+8(%rsp),%r9 #upper arg is nan/inf
+# movd %xmm6,%r9 ;upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1+8(%rsp) #rr = 0
+ mov %r10d,region1+4(%rsp) #region = 0
+ and .L__real_naninf_upper_sign_mask(%rip),%r13 # Sign
+
+.align 16
+0:
+
+ jmp .L__vrd4_sin_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm0 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm0 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movq %xmm4,region(%rsp) # Region
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm0 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm0,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm0 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm0 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm0,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ movapd %xmm0,%xmm6 # rhead
+ subpd %xmm8,%xmm0 # r = rhead - rtail
+ movapd %xmm0,r(%rsp)
+
+ subpd %xmm0,%xmm6 # rr=rhead-r
+ subpd %xmm8,%xmm6 # rr=(rhead-r) -rtail
+ movapd %xmm6,rr(%rsp)
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+# movlhps %xmm1,%xmm1 #Not needed since we want to work on lower arg, but done just to be safe and avoide exceptions due to nan/inf and to mirror the lower_arg_gt_5e5 case
+# movlhps %xmm3,%xmm3
+# movlhps %xmm7,%xmm7
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm0 # xmm0 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+ movsd %xmm7,%xmm1
+ subsd %xmm0,%xmm1 # xmm0 = r=(rhead-rtail)
+ subsd %xmm1,%xmm7 # rr=rhead-r
+ subsd %xmm0,%xmm7 # xmm6 = rr=((rhead-r) -rtail)
+
+ movlpd %xmm1,r1(%rsp) # store upper r
+ movlpd %xmm7,rr1(%rsp) # store upper rr
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrd4_sin_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea rr1+8(%rsp),%rsi
+ lea r1+8(%rsp),%rdi
+ movlpd r1+8(%rsp),%xmm0 #Restore upper fp arg for remainder_piby2 call
+ call __amd_remainder_piby2@PLT
+ jmp 0f
+
+.L__vrd4_sin_upper_naninf_higher:
+ mov p_original1+8(%rsp),%r9 # upper arg is nan/inf
+# mov r1+8(%rsp),%r9 ; upper arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ xor %r10,%r10
+ mov %r10,rr1+8(%rsp) # rr = 0
+ mov %r10d,region1+4(%rsp) # region =0
+ and .L__real_naninf_upper_sign_mask(%rip),%r13 # Sign
+
+.align 16
+0:
+ jmp .L__vrd4_sin_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm0 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#DEBUG
+# movapd region(%rsp), %xmm4
+# movapd region1(%rsp), %xmm5
+# movapd region(%rsp), %xmm12
+# movapd region1(%rsp), %xmm13
+# jmp .L__vrd4_sin_cleanup
+#DEBUG
+
+
+ movapd r(%rsp),%xmm0
+ movapd r1(%rsp),%xmm1
+
+ movapd rr(%rsp),%xmm6
+ movapd rr1(%rsp),%xmm7
+
+ mov region(%rsp),%r8
+ mov region1(%rsp),%r9
+
+ movlpd region(%rsp),%xmm4
+ movlpd region1(%rsp),%xmm5
+
+ pand .L__reald_one_one(%rip),%xmm4 #odd/even region for cos/sin
+ pand .L__reald_one_one(%rip),%xmm5 #odd/even region for cos/sin
+
+ xorpd %xmm12,%xmm12
+ pcmpeqd %xmm12,%xmm4
+ pcmpeqd %xmm12,%xmm5
+
+ punpckldq %xmm4,%xmm4
+ punpckldq %xmm5,%xmm5
+
+ movapd %xmm4,p_region(%rsp)
+ movapd %xmm5,p_region1(%rsp)
+
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+ mov %r8,%r10
+ mov %r9,%r11
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_signs(%rsp) #write out lower sign bit
+ mov %r12,p_signs+8(%rsp) #write out upper sign bit
+ mov %r11,p_signs1(%rsp) #write out lower sign bit
+ mov %r13,p_signs1+8(%rsp) #write out upper sign bit
+
+ movapd %xmm0,%xmm2 # r
+ movapd %xmm1,%xmm3 # r
+
+ mulpd %xmm0,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ add .L__reald_one_one(%rip),%r8
+ add .L__reald_one_one(%rip),%r9
+
+ and .L__reald_two_two(%rip),%r8
+ and .L__reald_two_two(%rip),%r9
+
+ shr $1,%r8
+ shr $1,%r9
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ and .L__reald_one_zero(%rip),%rax #mask out the lower sign bit leaving the upper sign bit
+ and .L__reald_one_zero(%rip),%rcx #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r8 #shift lower sign bit left by 63 bits
+ shl $63,%r9 #shift lower sign bit left by 63 bits
+
+ shl $31,%rax #shift upper sign bit left by 31 bits
+ shl $31,%rcx #shift upper sign bit left by 31 bits
+
+ mov %r8,p_signc(%rsp) #write out lower sign bit
+ mov %rax,p_signc+8(%rsp) #write out upper sign bit
+ mov %r9,p_signc1(%rsp) #write out lower sign bit
+ mov %rcx,p_signc1+8(%rsp) #write out upper sign bit
+
+ jmp .Lsinsin_sinsin_piby4
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrd4_sin_cleanup:
+
+ xorpd p_signs(%rsp),%xmm4 # (+) Sign
+ xorpd p_signs1(%rsp),%xmm5 # (+) Sign
+
+ xorpd p_signc(%rsp),%xmm12 # (+) Sign
+ xorpd p_signc1(%rsp),%xmm13 # (+) Sign
+
+.L__vrda_bottom1:
+# store the result _m128d
+ mov save_ysa(%rsp),%rdi # get ysin_array pointer
+ mov save_yca(%rsp),%rbx # get ycos_array pointer
+
+ movlpd %xmm4,(%rdi)
+ movhpd %xmm4,8(%rdi)
+
+ movlpd %xmm12,(%rbx)
+ movhpd %xmm12,8(%rbx)
+
+.L__vrda_bottom2:
+
+ prefetch 64(%rdi)
+ prefetch 64(%rbx)
+
+ add $32,%rdi
+ add $32,%rbx
+
+ mov %rdi,save_ysa(%rsp) # save ysin_array pointer
+ mov %rbx,save_yca(%rsp) # save ycos_array pointer
+
+# store the result _m128d
+ movlpd %xmm5, -16(%rdi)
+ movhpd %xmm5, -8(%rdi)
+
+ movlpd %xmm13, -16(%rbx)
+ movhpd %xmm13, -8(%rbx)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vrda_top
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vrda_cleanup
+
+.L__final_check:
+
+ mov save_r12(%rsp),%r12 # restore r12
+ mov save_r13(%rsp),%r13 # restore r13
+ mov save_rbx(%rsp),%rbx # restore rbx
+
+ add $0x0308,%rsp
+ ret
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# we assume that rdx is pointing at the next x array element, r8 at the next y array element.
+# The number of values left is in save_nv
+
+.align 16
+.L__vrda_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov save_xa(%rsp),%rsi
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorpd %xmm0,%xmm0
+ movlpd %xmm0,p_temp+8(%rsp)
+ movapd %xmm0,p_temp+16(%rsp)
+
+ mov (%rsi),%rcx # we know there's at least one
+ mov %rcx,p_temp(%rsp)
+ cmp $2,%rax
+ jl .L__vrdacg
+
+ mov 8(%rsi),%rcx # do the second value
+ mov %rcx,p_temp+8(%rsp)
+ cmp $3,%rax
+ jl .L__vrdacg
+
+ mov 16(%rsi),%rcx # do the third value
+ mov %rcx,p_temp+16(%rsp)
+
+.L__vrdacg:
+ mov $4,%rdi # parameter for N
+ lea p_temp(%rsp),%rsi # &x parameter
+ lea p_temp2(%rsp),%rdx # &ys parameter
+ lea p_temp4(%rsp),%rcx # &yc parameter
+
+ call vrda_sincos@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ysa(%rsp),%rdi
+ mov save_yca(%rsp),%rbx
+ mov save_nv(%rsp),%rax # get number of values
+
+ mov p_temp2(%rsp),%rcx
+ mov %rcx,(%rdi) # we know there's at least one
+ mov p_temp4(%rsp),%rdx
+ mov %rdx,(%rbx) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vrdacgf
+
+ mov p_temp2+8(%rsp),%rcx
+ mov %rcx,8(%rdi) # do the second value
+ mov p_temp4+8(%rsp),%rdx
+ mov %rdx,8(%rbx) # do the second value
+ cmp $3,%rax
+ jl .L__vrdacgf
+
+ mov p_temp2+16(%rsp),%rcx
+ mov %rcx,16(%rdi) # do the third value
+ mov p_temp4+16(%rsp),%rdx
+ mov %rdx,16(%rbx) # do the third value
+
+.L__vrdacgf:
+ jmp .L__final_check
diff --git a/src/gas/vrs4cosf.S b/src/gas/vrs4cosf.S
new file mode 100644
index 0000000..ab59058
--- /dev/null
+++ b/src/gas/vrs4cosf.S
@@ -0,0 +1,2122 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4_cosf.s
+#
+# A vector implementation of the cosf libm function.
+#
+# Prototype:
+#
+# __m128 __vrs4_cosf(__m128 x);
+#
+# Computes Cosine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 single precision Cosine values at a time.
+# The four values are passed as packed single in xmm10.
+# The four results are returned as packed singles in xmm10.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 4 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+
+.Lcosarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03FA5555555502F31
+ .quad 0x0BF56C16BF55699D7 # -0.00138889 c2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4
+ .quad 0x0BE92524743CC46B8
+
+.Lsinarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BFC555555545E87D
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x03F811110DF01232D
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x0BE92524743CC46B8
+
+.Lcossinarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BF56C16BF55699D7 # c2
+ .quad 0x03F811110DF01232D
+ .quad 0x03EFA015C50A93B49 # c3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x0BE92524743CC46B8 # c4
+ .quad 0x03EC6DBE4AD1572D5
+
+
+.align 64
+.Levencos_oddsin_tbl:
+
+ .quad .Lcoscos_coscos_piby4 # 0 * ; Done
+ .quad .Lcoscos_cossin_piby4 # 1 + ; Done
+ .quad .Lcoscos_sincos_piby4 # 2 ; Done
+ .quad .Lcoscos_sinsin_piby4 # 3 + ; Done
+
+ .quad .Lcossin_coscos_piby4 # 4 ; Done
+ .quad .Lcossin_cossin_piby4 # 5 * ; Done
+ .quad .Lcossin_sincos_piby4 # 6 ; Done
+ .quad .Lcossin_sinsin_piby4 # 7 ; Done
+
+ .quad .Lsincos_coscos_piby4 # 8 ; Done
+ .quad .Lsincos_cossin_piby4 # 9 ; TBD
+ .quad .Lsincos_sincos_piby4 # 10 * ; Done
+ .quad .Lsincos_sinsin_piby4 # 11 ; Done
+
+ .quad .Lsinsin_coscos_piby4 # 12 ; Done
+ .quad .Lsinsin_cossin_piby4 # 13 + ; Done
+ .quad .Lsinsin_sincos_piby4 # 14 ; Done
+ .quad .Lsinsin_sinsin_piby4 # 15 * ; Done
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .text
+ .align 16
+ .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ p_temp,0 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for get/put bits operation
+
+.equ save_xmm6,0x20 # temporary for get/put bits operation
+.equ save_xmm7,0x30 # temporary for get/put bits operation
+.equ save_xmm8,0x40 # temporary for get/put bits operation
+.equ save_xmm9,0x50 # temporary for get/put bits operation
+.equ save_xmm0,0x60 # temporary for get/put bits operation
+.equ save_xmm11,0x70 # temporary for get/put bits operation
+.equ save_xmm12,0x80 # temporary for get/put bits operation
+.equ save_xmm13,0x90 # temporary for get/put bits operation
+.equ save_xmm14,0x0A0 # temporary for get/put bits operation
+.equ save_xmm15,0x0B0 # temporary for get/put bits operation
+
+.equ r,0x0C0 # pointer to r for remainder_piby2
+.equ rr,0x0D0 # pointer to r for remainder_piby2
+.equ region,0x0E0 # pointer to r for remainder_piby2
+
+.equ r1,0x0F0 # pointer to r for remainder_piby2
+.equ rr1,0x0100 # pointer to r for remainder_piby2
+.equ region1,0x0110 # pointer to r for remainder_piby2
+
+.equ p_temp2,0x0120 # temporary for get/put bits operation
+.equ p_temp3,0x0130 # temporary for get/put bits operation
+
+.equ p_temp4,0x0140 # temporary for get/put bits operation
+.equ p_temp5,0x0150 # temporary for get/put bits operation
+
+.equ p_original,0x0160 # original x
+.equ p_mask,0x0170 # original x
+.equ p_sign,0x0180 # original x
+
+.equ p_original1,0x0190 # original x
+.equ p_mask1,0x01A0 # original x
+.equ p_sign1,0x01B0 # original x
+
+
+.equ save_r12,0x01C0 # temporary for get/put bits operation
+.equ save_r13,0x01D0 # temporary for get/put bits operation
+
+
+.globl __vrs4_cosf
+ .type __vrs4_cosf,@function
+__vrs4_cosf:
+ sub $0x01E8,%rsp
+
+#DEBUG
+# mov %r12,save_r12(%rsp) # save r12
+# mov %r13,save_r13(%rsp) # save r13
+
+# mov save_r12(%rsp),%r12 # restore r12
+# mov save_r13(%rsp),%r13 # restore r13
+
+# add $0x01E8,%rsp
+# ret
+#DEBUG
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ mov %r12,save_r12(%rsp) # save r12
+ mov %r13,save_r13(%rsp) # save r13
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+ movhlps %xmm0,%xmm8
+ cvtps2pd %xmm0,%xmm10 # convert input to double.
+ cvtps2pd %xmm8,%xmm1 # convert input to double.
+
+movdqa %xmm10,%xmm6
+movdqa %xmm1,%xmm7
+movapd .L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd %xmm2,%xmm10 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+movd %xmm10,%rax #rax is lower arg
+movhpd %xmm10, p_temp+8(%rsp) #
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+
+movd %xmm1,%r8 #r8 is lower arg
+movhpd %xmm1, p_temp1+8(%rsp) #
+mov p_temp1+8(%rsp),%r9 #r9 = upper arg
+
+movdqa %xmm10,%xmm12
+movdqa %xmm1,%xmm13
+
+pcmpgtd %xmm6,%xmm12
+pcmpgtd %xmm7,%xmm13
+movdqa %xmm12,%xmm6
+movdqa %xmm13,%xmm7
+psrldq $4,%xmm12
+psrldq $4,%xmm13
+psrldq $8,%xmm6
+psrldq $8,%xmm7
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use +
+
+por %xmm6,%xmm12
+por %xmm7,%xmm13
+
+movapd %xmm10,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm10,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm10
+ mulpd %xmm10,%xmm2 # * twobypi
+ mulpd %xmm10,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%r8 # Region
+ movd %xmm5,%r9 # Region
+
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+ mov %r8,%r10
+ mov %r9,%r11
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail, r9 = region, r11 = region, r13 = Sign
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ xor %rax,%r10
+ xor %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm10,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2 # move r for r2
+ movapd %xmm1,%xmm3 # move r for r2
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+ leaq .Levencos_oddsin_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+ movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm10
+ subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail)
+ subsd %xmm10,%xmm6 # rr=rhead-r
+ subsd %xmm0,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm10,r+8(%rsp) # store upper r
+ movlpd %xmm6,rr+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_lower_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_cosf_lower_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+ movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+ movd %xmm10,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrs4_cosf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm6,%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrs4_cosf_upper_naninf_of_both_gt_5e5:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+
+ subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail)
+
+ movlpd %xmm6,r(%rsp) # store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_upper_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_cosf_upper_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm5,region1(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+ jmp .L__vrs4_cosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+ mov $0x411E848000000000,%r10 #5e5 +
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+
+ subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail)
+
+ movlpd %xmm7,r1+8(%rsp) # store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_cosf_lower_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_cosf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+ movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm1,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+
+ jmp 0f
+
+.L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm7,%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) #region = 0
+
+.align 16
+0:
+
+ jmp .L__vrs4_cosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+
+ subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail)
+
+ movlpd %xmm7,r1(%rsp) # store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_cosf_upper_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_cosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_cosf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd r(%rsp),%xmm10
+ movapd r1(%rsp),%xmm1
+
+ mov region(%rsp),%r8
+ mov region1(%rsp),%r9
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+
+ mov %r8,%r10
+ mov %r9,%r11
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ xor %rax,%r10
+ xor %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+ leaq .Levencos_oddsin_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_cosf_cleanup:
+
+ movapd p_sign(%rsp),%xmm10
+ movapd p_sign1(%rsp),%xmm1
+
+ xorpd %xmm4,%xmm10 # (+) Sign
+ xorpd %xmm5,%xmm1 # (+) Sign
+
+ cvtpd2ps %xmm10,%xmm0
+ cvtpd2ps %xmm1,%xmm11
+ movlhps %xmm11,%xmm0
+
+ mov save_r12(%rsp),%r12 # restore r12
+ mov save_r13(%rsp),%r13 # restore r13
+
+ add $0x01E8,%rsp
+ ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+ movapd %xmm2,%xmm0 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm2,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm0,%xmm4 # + t
+ subpd %xmm11,%xmm5 # + t
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # s4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # s4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # s2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # s4+x2s3
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # s4+x2s3
+ addpd .Lsincosarray(%rip),%xmm8 # s2+x2s1
+ addpd .Lsincosarray(%rip),%xmm9 # s2+x2s1
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm0,%xmm0 # move high x4 for cos term
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd %xmm10,%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos
+ movhlps %xmm3,%xmm13 # move high r for cos
+
+ movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos
+ movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm7,%xmm5 # sin *x3
+
+ mulsd %xmm0,%xmm8 # cos *x4
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0
+
+ addsd %xmm10,%xmm4 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+ subsd %xmm12,%xmm8 # cos+t
+ subsd %xmm13,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:
+
+ movapd .Lsincosarray+0x30(%rip),%xmm4 # s4
+ movapd .Lcossinarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lsincosarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm3,%xmm7 # sincos term upper x2 for x3
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # s3+x2s4
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # s3+x2s4
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2s2
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2s2
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm0,%xmm0 # move high x4 for cos term
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin)
+ mulpd %xmm1,%xmm7
+
+ mulsd %xmm10,%xmm6 # get low x3 for sin term (cossin)
+ movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+
+ movhlps %xmm2,%xmm12 # move high r for cos (cossin)
+
+
+ movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin)
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos)
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm11,%xmm5 # cos *x4
+ mulsd %xmm0,%xmm8 # cos *x4
+ mulsd %xmm7,%xmm9 # sin *x3
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0
+
+ movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos)
+
+ addsd %xmm10,%xmm4 # sin + x +
+ addsd %xmm11,%xmm9 # sin + x +
+
+ subsd %xmm12,%xmm8 # cos+t
+ subsd %xmm3,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd .Lcossinarray+0x30(%rip),%xmm4 # s4
+ movapd .Lcossinarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm2,%xmm6 # move x2 for x4
+ movapd %xmm3,%xmm7 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # s4+x2s3
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # s4+x2s3
+ addpd .Lcossinarray(%rip),%xmm8 # s2+x2s1
+ addpd .Lcossinarray(%rip),%xmm9 # s2+x2s1
+
+ mulpd %xmm0,%xmm4 # x4(s4+x2s3)
+ mulpd %xmm11,%xmm5 # x4(s4+x2s3)
+
+ mulpd %xmm10,%xmm6 # get low x3 for sin term
+ mulpd %xmm1,%xmm7 # get low x3 for sin term
+ movhlps %xmm6,%xmm6 # move low x2 for x3 for sin term
+ movhlps %xmm7,%xmm7 # move low x2 for x3 for sin term
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos terms
+ mulsd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm4,%xmm12 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm13 # xmm9 = sin , xmm5 = cos
+
+ mulsd %xmm6,%xmm12 # sin *x3
+ mulsd %xmm7,%xmm13 # sin *x3
+ mulsd %xmm0,%xmm4 # cos *x4
+ mulsd %xmm11,%xmm5 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0
+
+ movhlps %xmm10,%xmm0 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x for sin term
+ # Reverse 10 and 0
+
+ addsd %xmm0,%xmm12 # sin + x
+ addsd %xmm11,%xmm13 # sin + x
+
+ subsd %xmm2,%xmm4 # cos+t
+ subsd %xmm3,%xmm5 # cos+t
+
+ movlhps %xmm12,%xmm4
+ movlhps %xmm13,%xmm5
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+
+ movapd .Lcossinarray+0x30(%rip),%xmm4 # s4
+ movapd .Lsincosarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lsincosarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm2,%xmm7 # upper x2 for x3 for sin term (sincos)
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # s3+x2s4
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # s3+x2s4
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2s2
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2s2
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+
+ movsd %xmm3,%xmm6 # move low x2 for x3 for sin term (cossin)
+ mulpd %xmm10,%xmm7
+
+ mulsd %xmm1,%xmm6 # get low x3 for sin term (cossin)
+ movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos)
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm3,%xmm12 # move high r for cos (cossin)
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos (sincos)
+ movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin (cossin)
+
+ mulsd %xmm0,%xmm4 # cos *x4
+ mulsd %xmm6,%xmm5 # sin *x3
+ mulsd %xmm7,%xmm8 # sin *x3
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0
+
+ movhlps %xmm10,%xmm11 # move high x for x for sin term (sincos)
+
+ subsd %xmm2,%xmm4 # cos-(-t)
+ subsd %xmm12,%xmm9 # cos-(-t)
+
+ addsd %xmm11,%xmm8 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrs4_cosf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: SIN
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: COS
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm0 # x2 ; SIN
+ movapd %xmm3,%xmm11 # x2 ; COS
+ movapd %xmm3,%xmm1 # copy of x2 for x4
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm0 # x4
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm3,%xmm1 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+
+ mulpd %xmm0,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm1,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm1,%xmm5 # x4 * zc
+
+ addpd %xmm10,%xmm4 # +x
+ subpd %xmm11,%xmm5 # +t
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: COS
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: SIN
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm0 # x2 ; COS
+ movapd %xmm3,%xmm11 # x2 ; SIN
+ movapd %xmm2,%xmm10 # copy of x2 for x4
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # s4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # s2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # s4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # s2*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # s3+x2c4
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # s1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm10,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm10,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x3 * zc
+
+ subpd %xmm0,%xmm4 # +t
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrs4_cosf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4: #Derive from cossin_coscos
+ movhlps %xmm2,%xmm0 # x2 for 0.5x2 for upper cos
+ movsd %xmm2,%xmm6 # lower x2 for x3 for lower sin
+ movapd %xmm3,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lsincosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcosarray(%rip),%xmm9 # c2+x2c1
+
+ movapd %xmm12,%xmm2 # upper=x4
+ movsd %xmm6,%xmm2 # lower=x2
+ mulsd %xmm10,%xmm2 # lower=x2*x
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # upper= x4 * zc
+ # lower=x3 * zs
+ mulpd %xmm13,%xmm5 # x4 * zc
+
+ movlhps %xmm7,%xmm10 #
+ addpd %xmm10,%xmm4 # +x for lower sin, +t for upper cos
+ subpd %xmm11,%xmm5 # -(-t)
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4: #Derive from cossin_coscos
+ movsd %xmm2,%xmm0 # x2 for 0.5x2 for lower cos
+ movapd %xmm3,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcossinarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcossinarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcossinarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcosarray(%rip),%xmm9 # c2+x2c1
+
+ mulpd %xmm10,%xmm2 # upper=x3 for sin
+ mulsd %xmm10,%xmm2 # lower=x4 for cos
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # lower= x4 * zc
+ # upper= x3 * zs
+ mulpd %xmm13,%xmm5 # x4 * zc
+
+ movsd %xmm7,%xmm10
+ addpd %xmm10,%xmm4 # +x for upper sin, +t for lower cos
+ subpd %xmm11,%xmm5 # -(-t)
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+ movhlps %xmm3,%xmm0 # x2 for 0.5x2 for upper cos
+ movapd %xmm2,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd %xmm3,%xmm6 # lower x2 for x3 for sin
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lsincosarray(%rip),%xmm9 # c2+x2c1
+
+ movapd %xmm13,%xmm3 # upper=x4
+ movsd %xmm6,%xmm3 # lower x2
+ mulsd %xmm1,%xmm3 # lower x2*x
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm12,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # upper= x4 * zc
+ # lower=x3 * zs
+
+ movlhps %xmm7,%xmm1
+ addpd %xmm1,%xmm5 # +x for lower sin, +t for upper cos
+ subpd %xmm11,%xmm4 # -(-t)
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4: # Derived from sincos_coscos
+
+ movhlps %xmm3,%xmm0 # x2
+ movapd %xmm3,%xmm7
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsincosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+ movapd %xmm13,%xmm3 # upper x4 for cos
+ movsd %xmm7,%xmm3 # lower x2 for sin
+ mulsd %xmm1,%xmm3 # lower x3=x2*x for sin
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm11,%xmm1 # t for upper cos and x for lower sin
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # upper=x4 * zc
+ # lower=x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +t upper, +x lower
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+ movsd %xmm3,%xmm0 # x2 for 0.5x2 for lower cos
+ movapd %xmm2,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lcossinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcossinarray(%rip),%xmm9 # c2+x2c1
+
+ mulpd %xmm1,%xmm3 # upper=x3 for sin
+ mulsd %xmm1,%xmm3 # lower=x4 for cos
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm12,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # lower= x4 * zc
+ # upper= x3 * zs
+
+ movsd %xmm7,%xmm1
+ subpd %xmm11,%xmm4 # -(-t)
+ addpd %xmm1,%xmm5 # +x for upper sin, +t for lower cos
+
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4: # Derived from sincos_coscos
+
+ movsd %xmm3,%xmm0 # x2
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcossinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcossinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm1,%xmm6
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm6,%xmm11 # upper =t ; lower =x
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm11,%xmm5 # +t lower, +x upper
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4: # Derived from sincos_coscos
+
+ movhlps %xmm2,%xmm0 # x2
+ movapd %xmm2,%xmm7
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsincosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ movapd %xmm12,%xmm2 # upper x4 for cos
+ movsd %xmm7,%xmm2 # lower x2 for sin
+ mulsd %xmm10,%xmm2 # lower x3=x2*x for sin
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm11,%xmm10 # t for upper cos and x for lower sin
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # upper=x4 * zc
+ # lower=x3 * zs
+
+ addpd %xmm1,%xmm5 # +x
+ addpd %xmm10,%xmm4 # +t upper, +x lower
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4: # Derived from sincos_coscos
+
+ movsd %xmm2,%xmm0 # x2
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lcossinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcossinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lcossinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm10,%xmm2 # upper x3 for sin
+ mulsd %xmm10,%xmm2 # lower x4 for cos
+
+ movhlps %xmm10,%xmm6
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm6,%xmm11
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ addpd %xmm1,%xmm5 # +x
+ addpd %xmm11,%xmm4 # +t lower, +x upper
+
+ jmp .L__vrs4_cosf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ #x2 = x * x;
+ #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))));
+
+ #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4));
+
+
+ movapd %xmm2,%xmm0 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ mulpd %xmm10,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # x3
+
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm0,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm11,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrs4_cosf_cleanup
diff --git a/src/gas/vrs4expf.S b/src/gas/vrs4expf.S
new file mode 100644
index 0000000..b0e23aa
--- /dev/null
+++ b/src/gas/vrs4expf.S
@@ -0,0 +1,410 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# __vrs4_expf.s
+#
+# A vector implementation of the expf libm function.
+# This routine implemented in single precision. It is slightly
+# less accurate than the double precision version, but it will
+# be better for vectorizing.
+#
+# Prototype:
+#
+# __m128 __vrs4_expf(__m128 x);
+#
+# Computes e raised to the x power for 4 floats at a time.
+# Does not perform error handling, but does return C99 values for error
+# inputs. Denormal results are truncated to 0.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_temp,0 # temporary for get/put bits operation
+.equ p_ux,0x10 # local storage for ux array
+.equ p_m,0x20 # local storage for m array
+.equ p_j,0x30 # local storage for m array
+.equ save_rbx,0x040 #qword
+.equ stack_size,0x48
+
+
+
+.globl __vrs4_expf
+ .type __vrs4_expf,@function
+__vrs4_expf:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp)
+
+
+ movaps %xmm0,p_ux(%rsp)
+ maxps .L__real_m8192(%rip),%xmm0 # protect against small input values
+
+
+# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+# Step 1. Reduce the argument.
+ # r = x * thirtytwo_by_logbaseof2;
+ movaps .L__real_thirtytwo_by_log2(%rip),%xmm2 #
+ mulps %xmm0,%xmm2
+ xor %rax,%rax
+ minps .L__real_8192(%rip),%xmm2 # protect against large input values
+
+# /* Set n = nearest integer to r */
+ cvtps2dq %xmm2,%xmm3
+ lea .L__two_to_jby32_table(%rip),%rdi
+ cvtdq2ps %xmm3,%xmm1
+
+
+# r1 = x - n * logbaseof2_by_32_lead;
+ movaps .L__real_log2_by_32_head(%rip),%xmm2
+ mulps %xmm1,%xmm2
+ subps %xmm2,%xmm0 # r1 in xmm0,
+
+# r2 = - n * logbaseof2_by_32_lead;
+ mulps .L__real_log2_by_32_tail(%rip),%xmm1
+
+# j = n & 0x0000001f;
+ movdqa %xmm3,%xmm4
+ movdqa .L__int_mask_1f(%rip),%xmm2
+ pand %xmm4,%xmm2
+ movdqa %xmm2,p_j(%rsp)
+# f1 = two_to_jby32_lead_table[j];
+
+# *m = (n - j) / 32;
+ psubd %xmm2,%xmm4
+ psrad $5,%xmm4
+ movdqa %xmm4,p_m(%rsp)
+
+ movaps %xmm0,%xmm3
+ addps %xmm1,%xmm3
+
+ mov p_j(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j(%rsp) # save the f1 value
+
+# Step 2. Compute the polynomial.
+# q = r1 +
+# r*r*( 5.00000000000000008883e-01 +
+# r*( 1.66666666665260878863e-01 +
+# r*( 4.16666666662260795726e-02 +
+# r*( 8.33336798434219616221e-03 +
+# r*( 1.38889490863777199667e-03 )))));
+# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+# q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision
+ movaps %xmm3,%xmm4
+ movaps %xmm3,%xmm2 # x*x
+ mulps %xmm2,%xmm2
+ mulps .L__real_1_24(%rip),%xmm4 # /24
+
+ mov p_j+4(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j+4(%rsp) # save the f1 value
+
+ addps .L__real_1_6(%rip),%xmm4 # +1/6
+
+ mulps %xmm2,%xmm3 # x^3
+ mov p_j+8(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j+8(%rsp) # save the f1 value
+ mulps .L__real_half(%rip),%xmm2 # x^2/2
+ mulps %xmm3,%xmm4 # *x^3
+
+ mov p_j+12(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j+12(%rsp) # save the f1 value
+ addps %xmm4,%xmm1 # +r2
+
+ addps %xmm2,%xmm1 # + x^2/2
+ addps %xmm1,%xmm0 # +r1
+
+# deal with infinite or denormal results
+ movdqa p_m(%rsp),%xmm1
+ movdqa p_m(%rsp),%xmm2
+ pcmpgtd .L__int_127(%rip),%xmm2
+ pminsw .L__int_128(%rip),%xmm1 # ceil at 128
+ movmskps %xmm2,%eax
+ test $0x0f,%eax
+
+ paddd .L__int_127(%rip),%xmm1 # add bias
+
+# *z2 = f2 + ((f1 + f2) * q);
+ mulps p_j(%rsp),%xmm0 # * f1
+ addps p_j(%rsp),%xmm0 # + f1
+ jnz .L__exp_largef
+.L__check1:
+
+ pxor %xmm2,%xmm2 # floor at 0
+ pmaxsw %xmm2,%xmm1
+
+ pslld $23,%xmm1 # build 2^n
+
+ movaps %xmm1,%xmm2
+
+
+# check for infinity or nan
+ movaps p_ux(%rsp),%xmm1
+ andps .L__real_infinity(%rip),%xmm1
+ cmpps $0,.L__real_infinity(%rip),%xmm1
+ movmskps %xmm1,%eax
+ test $0xf,%eax
+
+
+# end of splitexp
+# /* Scale (z1 + z2) by 2.0**m */
+# Step 3. Reconstitute.
+
+ mulps %xmm2,%xmm0 # result *= 2^n
+
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them. But it adds cycles for normal cases
+# to handle events that are supposed to be exceptions.
+# Using this branch with the
+# check above results in faster code for the normal cases.
+# And branch mispredict penalties should only come into
+# play for nans and infinities.
+ jnz .L__exp_naninf
+
+#
+.L__final_check:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+# deal with nans and infinities
+
+.L__exp_naninf:
+ movaps %xmm0,p_temp(%rsp) # save the computed values
+ mov %eax,%ecx # save the error mask
+ test $1,%ecx # first value?
+ jz .__Lni2
+ mov p_ux(%rsp),%edx # get the input
+ call .L__naninf
+ mov %edx,p_temp(%rsp) # save the new result
+.__Lni2:
+ test $2,%ecx # first value?
+ jz .__Lni3
+ mov p_ux+4(%rsp),%edx # get the input
+ call .L__naninf
+ mov %edx,p_temp+4(%rsp) # save the new result
+.__Lni3:
+ test $4,%ecx # first value?
+ jz .__Lni4
+ mov p_ux+8(%rsp),%edx # get the input
+ call .L__naninf
+ mov %edx,p_temp+8(%rsp) # save the new result
+.__Lni4:
+ test $8,%ecx # first value?
+ jz .__Lnie
+ mov p_ux+12(%rsp),%edx # get the input
+ call .L__naninf
+ mov %edx,p_temp+12(%rsp) # save the new result
+.__Lnie:
+ movaps p_temp(%rsp),%xmm0 # get the answers
+ jmp .L__final_check
+
+
+# a simple subroutine to check a scalar input value for infinity
+# or NaN and return the correct result
+# expects input in edx, and returns value in edx. Destroys eax.
+.L__naninf:
+ mov $0x0007FFFFF,%eax
+ test %eax,%edx
+ jnz .L__enan # jump if mantissa not zero, so it's a NaN
+# inf
+ mov %edx,%eax
+ rcl $1,%eax
+ jnc .L__r # exp(+inf) = inf
+ xor %edx,%edx # exp(-inf) = 0
+ jmp .L__r
+
+#NaN
+.L__enan:
+ mov $0x000400000,%eax # convert to quiet
+ or %eax,%edx
+.L__r:
+ ret
+
+ .align 16
+# deal with m > 127. In some instances, rounding during calculations
+# can result in infinity when it shouldn't. For these cases, we scale
+# m down, and scale the mantissa up.
+
+.L__exp_largef:
+ movdqa %xmm0,p_temp(%rsp) # save the mantissa portion
+ movdqa %xmm1,p_m(%rsp) # save the exponent portion
+ mov %eax,%ecx # save the error mask
+ test $1,%ecx # first value?
+ jz .L__Lf2
+ mov p_m(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m(%rsp) # save the exponent
+ movss p_temp(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_temp(%rsp) # save the mantissa
+.L__Lf2:
+ test $2,%ecx # second value?
+ jz .L__Lf3
+ mov p_m+4(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m+4(%rsp) # save the exponent
+ movss p_temp+4(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_temp+4(%rsp) # save the mantissa
+.L__Lf3:
+ test $4,%ecx # third value?
+ jz .L__Lf4
+ mov p_m+8(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m+8(%rsp) # save the exponent
+ movss p_temp+8(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_temp+8(%rsp) # save the mantissa
+.L__Lf4:
+ test $8,%ecx # fourth value?
+ jz .L__Lfe
+ mov p_m+12(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m+12(%rsp) # save the exponent
+ movss p_temp+12(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_temp+12(%rsp) # save the mantissa
+.L__Lfe:
+ movaps p_temp(%rsp),%xmm0 # restore the mantissa portion back
+ movdqa p_m(%rsp),%xmm1 # restore the exponent portion
+ jmp .L__check1
+
+ .data
+ .align 64
+
+.L__real_half: .long 0x3f000000 # 1/2
+ .long 0x3f000000
+ .long 0x3f000000
+ .long 0x3f000000
+
+.L__real_two: .long 0x40000000 # 2
+ .long 0x40000000
+ .long 0x40000000
+ .long 0x40000000
+
+.L__real_8192: .long 0x46000000 # 8192, to protect against really large numbers
+ .long 0x46000000
+ .long 0x46000000
+ .long 0x46000000
+.L__real_m8192: .long 0xC6000000 # -8192, to protect against really small numbers
+ .long 0xC6000000
+ .long 0xC6000000
+ .long 0xC6000000
+
+.L__real_thirtytwo_by_log2: .long 0x4238AA3B # thirtytwo_by_log2
+ .long 0x4238AA3B
+ .long 0x4238AA3B
+ .long 0x4238AA3B
+
+.L__real_log2_by_32: .long 0x3CB17218 # log2_by_32
+ .long 0x3CB17218
+ .long 0x3CB17218
+ .long 0x3CB17218
+
+.L__real_log2_by_32_head: .long 0x3CB17000 # log2_by_32
+ .long 0x3CB17000
+ .long 0x3CB17000
+ .long 0x3CB17000
+
+.L__real_log2_by_32_tail: .long 0xB585FDF4 # log2_by_32
+ .long 0xB585FDF4
+ .long 0xB585FDF4
+ .long 0xB585FDF4
+
+.L__real_1_6: .long 0x3E2AAAAB # 0.16666666666 used in polynomial
+ .long 0x3E2AAAAB
+ .long 0x3E2AAAAB
+ .long 0x3E2AAAAB
+
+.L__real_1_24: .long 0x3D2AAAAB # 0.041666668 used in polynomial
+ .long 0x3D2AAAAB
+ .long 0x3D2AAAAB
+ .long 0x3D2AAAAB
+
+.L__real_infinity: .long 0x7f800000 # infinity
+ .long 0x7f800000
+ .long 0x7f800000
+ .long 0x7f800000
+.L__int_mask_1f: .long 0x00000001f
+ .long 0x00000001f
+ .long 0x00000001f
+ .long 0x00000001f
+.L__int_128: .long 0x000000080
+ .long 0x000000080
+ .long 0x000000080
+ .long 0x000000080
+.L__int_127: .long 0x00000007f
+ .long 0x00000007f
+ .long 0x00000007f
+ .long 0x00000007f
+
+
+.L__two_to_jby32_table:
+ .long 0x3F800000 # 1
+ .long 0x3F82CD87 # 1.0218972
+ .long 0x3F85AAC3 # 1.0442737
+ .long 0x3F88980F # 1.0671405
+ .long 0x3F8B95C2 # 1.0905077
+ .long 0x3F8EA43A # 1.1143868
+ .long 0x3F91C3D3 # 1.1387886
+ .long 0x3F94F4F0 # 1.1637249
+ .long 0x3F9837F0 # 1.1892071
+ .long 0x3F9B8D3A # 1.2152474
+ .long 0x3F9EF532 # 1.2418578
+ .long 0x3FA27043 # 1.269051
+ .long 0x3FA5FED7 # 1.2968396
+ .long 0x3FA9A15B # 1.3252367
+ .long 0x3FAD583F # 1.3542556
+ .long 0x3FB123F6 # 1.3839099
+ .long 0x3FB504F3 # 1.4142135
+ .long 0x3FB8FBAF # 1.4451808
+ .long 0x3FBD08A4 # 1.4768262
+ .long 0x3FC12C4D # 1.5091645
+ .long 0x3FC5672A # 1.5422108
+ .long 0x3FC9B9BE # 1.5759809
+ .long 0x3FCE248C # 1.6104903
+ .long 0x3FD2A81E # 1.6457555
+ .long 0x3FD744FD # 1.6817929
+ .long 0x3FDBFBB8 # 1.7186193
+ .long 0x3FE0CCDF # 1.7562522
+ .long 0x3FE5B907 # 1.7947091
+ .long 0x3FEAC0C7 # 1.8340081
+ .long 0x3FEFE4BA # 1.8741677
+ .long 0x3FF5257D # 1.9152066
+ .long 0x3FFA83B3 # 1.9571441
+ .long 0 # for alignment
+
diff --git a/src/gas/vrs4log10f.S b/src/gas/vrs4log10f.S
new file mode 100644
index 0000000..d6d9ac8
--- /dev/null
+++ b/src/gas/vrs4log10f.S
@@ -0,0 +1,646 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4logf.s
+#
+# A vector implementation of the logf libm function.
+# This routine implemented in single precision. It is slightly
+# less accurate than the double precision version, but it will
+# be better for vectorizing.
+#
+# Prototype:
+#
+# __m128 __vrs4_log10f(__m128 x);
+#
+# Computes the natural log of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_x,0 # save x
+.equ p_idx,0x010 # xmmword index
+.equ p_z1,0x020 # xmmword index
+.equ p_q,0x030 # xmmword index
+.equ p_corr,0x040 # xmmword index
+.equ p_omask,0x050 # xmmword index
+.equ save_xmm6,0x060 #
+.equ save_rbx,0x070 #
+
+.equ stack_size,0x088
+
+
+
+.globl __vrs4_log10f
+ .type __vrs4_log10f,@function
+__vrs4_log10f:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+# check e as a special case
+ movdqa %xmm0,p_x(%rsp) # save x
+# movdqa %xmm0,%xmm2
+# cmpps $0,.L__real_ef(%rip),%xmm2
+# movmskps %xmm2,%r9d
+
+#
+# compute the index into the log tables
+#
+ movdqa %xmm0,%xmm3
+ movaps %xmm0,%xmm1
+ psrld $23,%xmm3
+ subps .L__real_one(%rip),%xmm1
+ psubd .L__mask_127(%rip),%xmm3
+ cvtdq2ps %xmm3,%xmm6 # xexp
+
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+ xor %r8,%r8
+ movdqa %xmm3,%xmm2
+ movaps .L__real_half(%rip),%xmm5 # .5
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrld $16,%xmm3
+ lea .L__np_ln_lead_table(%rip),%rdx
+ movdqa %xmm3,%xmm4
+ psrld $1,%xmm3
+ paddd .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm4
+ paddd %xmm4,%xmm3
+ cvtdq2ps %xmm3,%xmm1
+ packssdw %xmm3,%xmm3
+ movq %xmm3,p_idx(%rsp)
+
+
+# reduce and get u
+ movdqa %xmm0,%xmm3
+ orps .L__real_half(%rip),%xmm2
+
+
+ mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128
+
+
+ subps %xmm1,%xmm2 # f2 = f - f1
+ mulps %xmm2,%xmm5
+ addps %xmm5,%xmm1
+
+ divps %xmm1,%xmm2 # u
+
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1(%rsp) # save the f1 values
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1+8(%rsp) # save the f1 value
+
+# solve for ln(1+u)
+ movaps %xmm2,%xmm1 # u
+ mulps %xmm2,%xmm2 # u^2
+ movaps %xmm2,%xmm5
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movaps .L__real_cb3(%rip),%xmm3
+ mulps %xmm2,%xmm3 #Cu2
+ mulps %xmm1,%xmm5 # u^3
+ addps .L__real_cb2(%rip),%xmm3 #B+Cu2
+ movaps %xmm2,%xmm4
+ mulps %xmm5,%xmm4 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm2
+
+ mulps .L__real_cb1(%rip),%xmm5 #Au3
+ addps %xmm5,%xmm1 # u+Au3
+ mulps %xmm3,%xmm4 # u5(B+Cu2)
+
+ addps %xmm4,%xmm1 # poly
+
+# recombine
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q(%rsp) # save the f1 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q+8(%rsp) # save the f1 value
+
+ addps p_q(%rsp),%xmm1 #z2 +=q
+
+ movaps p_z1(%rsp),%xmm0 # z1 values
+
+ mulps %xmm6,%xmm2
+ addps %xmm2,%xmm0 #r1
+ movaps %xmm0,%xmm2
+ mulps .L__real_log2_tail(%rip),%xmm6
+ addps %xmm6,%xmm1 #r2
+ movaps %xmm1,%xmm3
+
+# logef to log10f
+ mulps .L__real_log10e_tail(%rip),%xmm1
+ mulps .L__real_log10e_tail(%rip),%xmm0
+ mulps .L__real_log10e_lead(%rip),%xmm3
+ mulps .L__real_log10e_lead(%rip),%xmm2
+ addps %xmm1,%xmm0
+ addps %xmm3,%xmm0
+ addps %xmm2,%xmm0
+# addps %xmm1,%xmm0
+
+# check for e
+# test $0x0f,%r9d
+# jnz .L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+ xorps %xmm1,%xmm1
+ cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm1,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg
+
+.L__f2:
+## if +inf
+ movaps p_x(%rsp),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf
+.L__f3:
+
+ movaps p_x(%rsp),%xmm3
+ subps .L__real_one(%rip),%xmm3
+ andps .L__real_notsign(%rip),%xmm3
+ cmpps $2,.L__real_threshold(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+.L__vlogf_e:
+ movdqa p_x(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ jmp .L__f1
+
+ .align 16
+.L__near_one:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm3,p_omask(%rsp) # save ones mask
+ movaps p_x(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+# u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm1
+ divps %xmm2,%xmm1 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm1,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+# u = u + u;
+ addps %xmm1,%xmm1 #u
+ movaps %xmm1,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm1,%xmm5 # Cu
+ movaps %xmm1,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm1
+ mulps %xmm1,%xmm1 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm1 #u6(Cu+Du3)
+ addps %xmm1,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+ movaps %xmm3,%xmm5 #r1=r
+ pand .L__mask_lower(%rip),%xmm5
+ subps %xmm5,%xmm3
+ addps %xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+ movaps %xmm5,%xmm3
+ movaps %xmm2,%xmm1
+
+ mulps .L__real_log10e_tail(%rip),%xmm2
+ mulps .L__real_log10e_tail(%rip),%xmm3
+ mulps .L__real_log10e_lead(%rip),%xmm1
+ mulps .L__real_log10e_lead(%rip),%xmm5
+ addps %xmm2,%xmm3
+ addps %xmm1,%xmm3
+ addps %xmm5,%xmm3
+# return r + r2;
+# addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm0,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+
+ jmp .L__finish
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+ movdqa %xmm1,%xmm3
+ andps %xmm0,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm1 # setup the nan values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+# check for +/- 0
+ xorps %xmm1,%xmm1
+ cmpps $0,p_x(%rsp),%xmm1 # 0 ?.
+ movmskps %xmm1,%r9d
+ test $0x0f,%r9d
+ jz .L__zn2
+
+ movdqa %xmm1,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+
+.L__zn2:
+# check for NaNs
+ movaps p_x(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x(%rsp),%xmm1 # isolate the NaNs
+ pand %xmm4,%xmm1
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm1,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm1
+ andnps %xmm0,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f2
+
+# handle only +inf log(+inf) = inf
+.L__log_inf:
+ movdqa %xmm3,%xmm1
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps p_x(%rsp),%xmm1 # setup the +inf values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+ jmp .L__f3
+
+
+ .data
+ .align 64
+
+.L__real_zero: .quad 0x00000000000000000 # 1.0
+ .quad 0x00000000000000000
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03c0000003c000000
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+.L__mask_040: .quad 0x00000004000000040 #
+ .quad 0x00000004000000040
+.L__mask_001: .quad 0x00000000100000001 #
+ .quad 0x00000000100000001
+
+
+.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03
+ .quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03
+ .quad 0x03B1249183B124918
+.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04
+ .quad 0x039E401A639E401A6
+.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03
+ .quad 0x03B124A123B124A12
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+.L__real_log10e_lead: .quad 0x03EDE00003EDE0000 # log10e_lead 0.4335937500
+ .quad 0x03EDE00003EDE0000
+.L__real_log10e_tail: .quad 0x03A37B1523A37B152 # log10e_tail 0.0007007319
+ .quad 0x03A37B1523A37B152
+
+
+.L__mask_lower: .quad 0x0ffff0000ffff0000 #
+ .quad 0x0ffff0000ffff0000
+
+.L__np_ln__table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02
+ .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02
+ .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02
+ .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02
+ .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02
+ .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02
+ .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01
+ .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01
+ .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01
+ .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01
+ .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01
+ .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01
+ .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01
+ .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01
+ .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01
+ .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01
+ .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01
+ .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01
+ .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01
+ .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01
+ .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01
+ .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01
+ .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01
+ .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01
+ .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01
+ .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01
+ .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01
+ .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01
+ .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01
+ .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01
+ .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01
+ .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01
+ .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01
+ .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01
+ .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01
+ .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01
+ .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01
+ .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01
+ .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01
+ .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01
+ .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01
+ .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01
+ .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01
+ .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01
+ .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01
+ .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01
+ .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01
+ .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01
+ .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01
+ .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01
+ .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01
+ .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01
+ .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01
+ .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01
+ .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01
+ .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01
+ .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01
+ .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01
+ .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01
+ .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01
+ .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01
+ .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01
+ .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01
+ .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_lead_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x3C7E0000 # 0.015502929688 1
+ .long 0x3CFC1000 # 0.030769348145 2
+ .long 0x3D3BA000 # 0.045806884766 3
+ .long 0x3D785000 # 0.060623168945 4
+ .long 0x3D9A0000 # 0.075195312500 5
+ .long 0x3DB78000 # 0.089599609375 6
+ .long 0x3DD49000 # 0.103790283203 7
+ .long 0x3DF13000 # 0.117767333984 8
+ .long 0x3E06B000 # 0.131530761719 9
+ .long 0x3E14A000 # 0.145141601563 10
+ .long 0x3E226000 # 0.158569335938 11
+ .long 0x3E2FF000 # 0.171813964844 12
+ .long 0x3E3D5000 # 0.184875488281 13
+ .long 0x3E4A9000 # 0.197814941406 14
+ .long 0x3E579000 # 0.210510253906 15
+ .long 0x3E647000 # 0.223083496094 16
+ .long 0x3E713000 # 0.235534667969 17
+ .long 0x3E7DC000 # 0.247802734375 18
+ .long 0x3E851000 # 0.259887695313 19
+ .long 0x3E8B3000 # 0.271850585938 20
+ .long 0x3E914000 # 0.283691406250 21
+ .long 0x3E974000 # 0.295410156250 22
+ .long 0x3E9D3000 # 0.307006835938 23
+ .long 0x3EA30000 # 0.318359375000 24
+ .long 0x3EA8D000 # 0.329711914063 25
+ .long 0x3EAE8000 # 0.340820312500 26
+ .long 0x3EB43000 # 0.351928710938 27
+ .long 0x3EB9C000 # 0.362792968750 28
+ .long 0x3EBF5000 # 0.373657226563 29
+ .long 0x3EC4D000 # 0.384399414063 30
+ .long 0x3ECA3000 # 0.394897460938 31
+ .long 0x3ECF9000 # 0.405395507813 32
+ .long 0x3ED4E000 # 0.415771484375 33
+ .long 0x3EDA2000 # 0.426025390625 34
+ .long 0x3EDF5000 # 0.436157226563 35
+ .long 0x3EE47000 # 0.446166992188 36
+ .long 0x3EE99000 # 0.456176757813 37
+ .long 0x3EEEA000 # 0.466064453125 38
+ .long 0x3EF3A000 # 0.475830078125 39
+ .long 0x3EF89000 # 0.485473632813 40
+ .long 0x3EFD7000 # 0.494995117188 41
+ .long 0x3F012000 # 0.504394531250 42
+ .long 0x3F039000 # 0.513916015625 43
+ .long 0x3F05F000 # 0.523193359375 44
+ .long 0x3F084000 # 0.532226562500 45
+ .long 0x3F0AA000 # 0.541503906250 46
+ .long 0x3F0CF000 # 0.550537109375 47
+ .long 0x3F0F4000 # 0.559570312500 48
+ .long 0x3F118000 # 0.568359375000 49
+ .long 0x3F13C000 # 0.577148437500 50
+ .long 0x3F160000 # 0.585937500000 51
+ .long 0x3F183000 # 0.594482421875 52
+ .long 0x3F1A7000 # 0.603271484375 53
+ .long 0x3F1C9000 # 0.611572265625 54
+ .long 0x3F1EC000 # 0.620117187500 55
+ .long 0x3F20E000 # 0.628417968750 56
+ .long 0x3F230000 # 0.636718750000 57
+ .long 0x3F252000 # 0.645019531250 58
+ .long 0x3F273000 # 0.653076171875 59
+ .long 0x3F295000 # 0.661376953125 60
+ .long 0x3F2B5000 # 0.669189453125 61
+ .long 0x3F2D6000 # 0.677246093750 62
+ .long 0x3F2F7000 # 0.685302734375 63
+ .long 0x3F317000 # 0.693115234375 64
+ .long 0 # for alignment
+
+.L__np_ln_tail_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x35A8B0FC # 0.000001256848 1
+ .long 0x361B0E78 # 0.000002310522 2
+ .long 0x3631EC66 # 0.000002651266 3
+ .long 0x35C30046 # 0.000001452871 4
+ .long 0x37EBCB0E # 0.000028108738 5
+ .long 0x37528AE5 # 0.000012549314 6
+ .long 0x36DA7496 # 0.000006510479 7
+ .long 0x3783B715 # 0.000015701671 8
+ .long 0x383F3E68 # 0.000045596069 9
+ .long 0x38297C10 # 0.000040408282 10
+ .long 0x3815B666 # 0.000035694240 11
+ .long 0x38183854 # 0.000036292084 12
+ .long 0x38448108 # 0.000046850211 13
+ .long 0x373539E9 # 0.000010801924 14
+ .long 0x3864A740 # 0.000054515200 15
+ .long 0x387BE3CD # 0.000060055219 16
+ .long 0x3803B715 # 0.000031403342 17
+ .long 0x380C36AF # 0.000033429529 18
+ .long 0x3892713A # 0.000069829126 19
+ .long 0x38AE55D6 # 0.000083129547 20
+ .long 0x38A0FDE8 # 0.000076766883 21
+ .long 0x3862BAE1 # 0.000054056643 22
+ .long 0x3798AAD3 # 0.000018199358 23
+ .long 0x38C5E10E # 0.000094356117 24
+ .long 0x382D872E # 0.000041372310 25
+ .long 0x38DEDFAC # 0.000106274470 26
+ .long 0x38481E9B # 0.000047712219 27
+ .long 0x38EBFB5E # 0.000112524940 28
+ .long 0x38783B83 # 0.000059183232 29
+ .long 0x374E1B05 # 0.000012284848 30
+ .long 0x38CA0E11 # 0.000096347307 31
+ .long 0x3891F660 # 0.000069600297 32
+ .long 0x386C9A9A # 0.000056410769 33
+ .long 0x38777BCD # 0.000059004688 34
+ .long 0x38A6CED4 # 0.000079540216 35
+ .long 0x38FBE3CD # 0.000120110439 36
+ .long 0x387E7E01 # 0.000060675669 37
+ .long 0x37D40984 # 0.000025276800 38
+ .long 0x3784C3AD # 0.000015826745 39
+ .long 0x380F5FAF # 0.000034182969 40
+ .long 0x38AC47BC # 0.000082149607 41
+ .long 0x392952D3 # 0.000161479504 42
+ .long 0x37F97073 # 0.000029735476 43
+ .long 0x3865C84A # 0.000054784388 44
+ .long 0x3979CF17 # 0.000238236375 45
+ .long 0x38C3D2F5 # 0.000093376184 46
+ .long 0x38E6B468 # 0.000110008579 47
+ .long 0x383EBCE1 # 0.000045475437 48
+ .long 0x39186BDF # 0.000145360347 49
+ .long 0x392F0945 # 0.000166927537 50
+ .long 0x38E9ED45 # 0.000111545007 51
+ .long 0x396B99A8 # 0.000224685878 52
+ .long 0x37A27674 # 0.000019367064 53
+ .long 0x397069AB # 0.000229275480 54
+ .long 0x39013539 # 0.000123222257 55
+ .long 0x3947F423 # 0.000190690669 56
+ .long 0x3945E10E # 0.000188712234 57
+ .long 0x38F85DB0 # 0.000118430122 58
+ .long 0x396C08DC # 0.000225100142 59
+ .long 0x37B4996F # 0.000021529120 60
+ .long 0x397CEADA # 0.000241200818 61
+ .long 0x3920261B # 0.000152729845 62
+ .long 0x35AA4906 # 0.000001268724 63
+ .long 0x3805FDF4 # 0.000031946183 64
+ .long 0 # for alignment
+
diff --git a/src/gas/vrs4log2f.S b/src/gas/vrs4log2f.S
new file mode 100644
index 0000000..05185b2
--- /dev/null
+++ b/src/gas/vrs4log2f.S
@@ -0,0 +1,639 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4log2f.s
+#
+# A vector implementation of the logf libm function.
+# This routine implemented in single precision. It is slightly
+# less accurate than the double precision version, but it will
+# be better for vectorizing.
+#
+# Prototype:
+#
+# __m128 __vrs4_log2f(__m128 x);
+#
+# Computes the natural log of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_x,0 # save x
+.equ p_idx,0x010 # xmmword index
+.equ p_z1,0x020 # xmmword index
+.equ p_q,0x030 # xmmword index
+.equ p_corr,0x040 # xmmword index
+.equ p_omask,0x050 # xmmword index
+.equ save_xmm6,0x060 #
+.equ save_rbx,0x070 #
+
+.equ stack_size,0x088
+
+
+
+.globl __vrs4_log2f
+ .type __vrs4_log2f,@function
+__vrs4_log2f:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+# check 2 as a special case
+ movdqa %xmm0,p_x(%rsp) # save x
+# movdqa %xmm0,%xmm2
+# cmpps $0,.L__real_ef(%rip),%xmm2
+# movmskps %xmm2,%r9d
+
+#
+# compute the index into the log tables
+#
+ movdqa %xmm0,%xmm3
+ movaps %xmm0,%xmm1
+ psrld $23,%xmm3
+ subps .L__real_one(%rip),%xmm1
+ psubd .L__mask_127(%rip),%xmm3
+ cvtdq2ps %xmm3,%xmm6 # xexp
+
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+ xor %r8,%r8
+ movdqa %xmm3,%xmm2
+ movaps .L__real_half(%rip),%xmm5 # .5
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrld $16,%xmm3
+ lea .L__np_ln_lead_table(%rip),%rdx
+ movdqa %xmm3,%xmm4
+ psrld $1,%xmm3
+ paddd .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm4
+ paddd %xmm4,%xmm3
+ cvtdq2ps %xmm3,%xmm1
+ packssdw %xmm3,%xmm3
+ movq %xmm3,p_idx(%rsp)
+
+
+# reduce and get u
+ movdqa %xmm0,%xmm3
+ orps .L__real_half(%rip),%xmm2
+
+
+ mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128
+
+
+ subps %xmm1,%xmm2 # f2 = f - f1
+ mulps %xmm2,%xmm5
+ addps %xmm5,%xmm1
+
+ divps %xmm1,%xmm2 # u
+
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1(%rsp) # save the f1 values
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1+8(%rsp) # save the f1 value
+
+# solve for ln(1+u)
+ movaps %xmm2,%xmm1 # u
+ mulps %xmm2,%xmm2 # u^2
+ movaps %xmm2,%xmm5
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movaps .L__real_cb3(%rip),%xmm3
+ mulps %xmm2,%xmm3 #Cu2
+ mulps %xmm1,%xmm5 # u^3
+ addps .L__real_cb2(%rip),%xmm3 #B+Cu2
+ movaps %xmm2,%xmm4
+ mulps %xmm5,%xmm4 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm2
+ movaps .L__real_log2e_lead(%rip),%xmm2
+
+ mulps .L__real_cb1(%rip),%xmm5 #Au3
+ addps %xmm5,%xmm1 # u+Au3
+ mulps %xmm3,%xmm4 # u5(B+Cu2)
+ movaps .L__real_log2e_tail(%rip),%xmm3
+ addps %xmm4,%xmm1 # poly
+
+# recombine
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q(%rsp) # save the f1 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q+8(%rsp) # save the f1 value
+
+ addps p_q(%rsp),%xmm1 #z2 +=q
+ movaps %xmm1,%xmm4 #z2 copy
+ movaps p_z1(%rsp),%xmm0 # z1 values
+ movaps %xmm0,%xmm5 #z1 copy
+ mulps %xmm2,%xmm5 #z1*log2e_lead
+ mulps %xmm2,%xmm1 #z2*log2e_lead
+ mulps %xmm3,%xmm4 #z2*log2e_tail
+ mulps %xmm3,%xmm0 #z1*log2e_tail
+ addps %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp
+ addps %xmm4,%xmm0 #z1*log2e_tail + z2*log2e_tail
+ addps %xmm1,%xmm0 #r2
+#return r1+r2
+ addps %xmm5,%xmm0 # r1+ r2
+# check for e
+# test $0x0f,%r9d
+# jnz .L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+ xorps %xmm1,%xmm1
+ cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm1,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg
+
+.L__f2:
+## if +inf
+ movaps p_x(%rsp),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf
+.L__f3:
+
+ movaps p_x(%rsp),%xmm3
+ subps .L__real_one(%rip),%xmm3
+ andps .L__real_notsign(%rip),%xmm3
+ cmpps $2,.L__real_threshold(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+.L__vlogf_e:
+ movdqa p_x(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ jmp .L__f1
+
+ .align 16
+.L__near_one:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm3,p_omask(%rsp) # save ones mask
+ movaps p_x(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+# u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm1
+ divps %xmm2,%xmm1 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm1,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+# u = u + u;
+ addps %xmm1,%xmm1 #u
+ movaps %xmm1,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm1,%xmm5 # Cu
+ movaps %xmm1,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm1
+ mulps %xmm1,%xmm1 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm1 #u6(Cu+Du3)
+ addps %xmm1,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+ movaps %xmm3,%xmm5 #r1=r
+ pand .L__mask_lower(%rip),%xmm5
+ subps %xmm5,%xmm3
+ addps %xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+ movaps %xmm5,%xmm3
+ movaps %xmm2,%xmm1
+
+ mulps .L__real_log2e_tail(%rip),%xmm2
+ mulps .L__real_log2e_tail(%rip),%xmm3
+ mulps .L__real_log2e_lead(%rip),%xmm1
+ mulps .L__real_log2e_lead(%rip),%xmm5
+ addps %xmm2,%xmm3
+ addps %xmm1,%xmm3
+ addps %xmm5,%xmm3
+# return r + r2;
+# addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm0,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+
+ jmp .L__finish
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+ movdqa %xmm1,%xmm3
+ andps %xmm0,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm1 # setup the nan values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+# check for +/- 0
+ xorps %xmm1,%xmm1
+ cmpps $0,p_x(%rsp),%xmm1 # 0 ?.
+ movmskps %xmm1,%r9d
+ test $0x0f,%r9d
+ jz .L__zn2
+
+ movdqa %xmm1,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+
+.L__zn2:
+# check for NaNs
+ movaps p_x(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x(%rsp),%xmm1 # isolate the NaNs
+ pand %xmm4,%xmm1
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm1,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm1
+ andnps %xmm0,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f2
+
+# handle only +inf log(+inf) = inf
+.L__log_inf:
+ movdqa %xmm3,%xmm1
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps p_x(%rsp),%xmm1 # setup the +inf values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+ jmp .L__f3
+
+
+ .data
+ .align 64
+
+.L__real_zero: .quad 0x00000000000000000 # 1.0
+ .quad 0x00000000000000000
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03c0000003c000000
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+.L__mask_040: .quad 0x00000004000000040 #
+ .quad 0x00000004000000040
+.L__mask_001: .quad 0x00000000100000001 #
+ .quad 0x00000000100000001
+
+
+.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03
+ .quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03
+ .quad 0x03B1249183B124918
+.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04
+ .quad 0x039E401A639E401A6
+.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03
+ .quad 0x03B124A123B124A12
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+.L__real_log2e_lead: .quad 0x03FB800003FB80000 #1.4375000000
+ .quad 0x03FB800003FB80000
+.L__real_log2e_tail: .quad 0x03BAA3B293BAA3B29 # 0.0051950408889633
+ .quad 0x03BAA3B293BAA3B29
+
+.L__mask_lower: .quad 0x0ffff0000ffff0000 #
+ .quad 0x0ffff0000ffff0000
+
+.L__np_ln__table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02
+ .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02
+ .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02
+ .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02
+ .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02
+ .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02
+ .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01
+ .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01
+ .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01
+ .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01
+ .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01
+ .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01
+ .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01
+ .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01
+ .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01
+ .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01
+ .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01
+ .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01
+ .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01
+ .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01
+ .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01
+ .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01
+ .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01
+ .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01
+ .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01
+ .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01
+ .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01
+ .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01
+ .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01
+ .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01
+ .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01
+ .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01
+ .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01
+ .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01
+ .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01
+ .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01
+ .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01
+ .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01
+ .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01
+ .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01
+ .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01
+ .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01
+ .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01
+ .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01
+ .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01
+ .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01
+ .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01
+ .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01
+ .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01
+ .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01
+ .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01
+ .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01
+ .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01
+ .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01
+ .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01
+ .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01
+ .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01
+ .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01
+ .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01
+ .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01
+ .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01
+ .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01
+ .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01
+ .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_lead_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x3C7E0000 # 0.015502929688 1
+ .long 0x3CFC1000 # 0.030769348145 2
+ .long 0x3D3BA000 # 0.045806884766 3
+ .long 0x3D785000 # 0.060623168945 4
+ .long 0x3D9A0000 # 0.075195312500 5
+ .long 0x3DB78000 # 0.089599609375 6
+ .long 0x3DD49000 # 0.103790283203 7
+ .long 0x3DF13000 # 0.117767333984 8
+ .long 0x3E06B000 # 0.131530761719 9
+ .long 0x3E14A000 # 0.145141601563 10
+ .long 0x3E226000 # 0.158569335938 11
+ .long 0x3E2FF000 # 0.171813964844 12
+ .long 0x3E3D5000 # 0.184875488281 13
+ .long 0x3E4A9000 # 0.197814941406 14
+ .long 0x3E579000 # 0.210510253906 15
+ .long 0x3E647000 # 0.223083496094 16
+ .long 0x3E713000 # 0.235534667969 17
+ .long 0x3E7DC000 # 0.247802734375 18
+ .long 0x3E851000 # 0.259887695313 19
+ .long 0x3E8B3000 # 0.271850585938 20
+ .long 0x3E914000 # 0.283691406250 21
+ .long 0x3E974000 # 0.295410156250 22
+ .long 0x3E9D3000 # 0.307006835938 23
+ .long 0x3EA30000 # 0.318359375000 24
+ .long 0x3EA8D000 # 0.329711914063 25
+ .long 0x3EAE8000 # 0.340820312500 26
+ .long 0x3EB43000 # 0.351928710938 27
+ .long 0x3EB9C000 # 0.362792968750 28
+ .long 0x3EBF5000 # 0.373657226563 29
+ .long 0x3EC4D000 # 0.384399414063 30
+ .long 0x3ECA3000 # 0.394897460938 31
+ .long 0x3ECF9000 # 0.405395507813 32
+ .long 0x3ED4E000 # 0.415771484375 33
+ .long 0x3EDA2000 # 0.426025390625 34
+ .long 0x3EDF5000 # 0.436157226563 35
+ .long 0x3EE47000 # 0.446166992188 36
+ .long 0x3EE99000 # 0.456176757813 37
+ .long 0x3EEEA000 # 0.466064453125 38
+ .long 0x3EF3A000 # 0.475830078125 39
+ .long 0x3EF89000 # 0.485473632813 40
+ .long 0x3EFD7000 # 0.494995117188 41
+ .long 0x3F012000 # 0.504394531250 42
+ .long 0x3F039000 # 0.513916015625 43
+ .long 0x3F05F000 # 0.523193359375 44
+ .long 0x3F084000 # 0.532226562500 45
+ .long 0x3F0AA000 # 0.541503906250 46
+ .long 0x3F0CF000 # 0.550537109375 47
+ .long 0x3F0F4000 # 0.559570312500 48
+ .long 0x3F118000 # 0.568359375000 49
+ .long 0x3F13C000 # 0.577148437500 50
+ .long 0x3F160000 # 0.585937500000 51
+ .long 0x3F183000 # 0.594482421875 52
+ .long 0x3F1A7000 # 0.603271484375 53
+ .long 0x3F1C9000 # 0.611572265625 54
+ .long 0x3F1EC000 # 0.620117187500 55
+ .long 0x3F20E000 # 0.628417968750 56
+ .long 0x3F230000 # 0.636718750000 57
+ .long 0x3F252000 # 0.645019531250 58
+ .long 0x3F273000 # 0.653076171875 59
+ .long 0x3F295000 # 0.661376953125 60
+ .long 0x3F2B5000 # 0.669189453125 61
+ .long 0x3F2D6000 # 0.677246093750 62
+ .long 0x3F2F7000 # 0.685302734375 63
+ .long 0x3F317000 # 0.693115234375 64
+ .long 0 # for alignment
+
+.L__np_ln_tail_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x35A8B0FC # 0.000001256848 1
+ .long 0x361B0E78 # 0.000002310522 2
+ .long 0x3631EC66 # 0.000002651266 3
+ .long 0x35C30046 # 0.000001452871 4
+ .long 0x37EBCB0E # 0.000028108738 5
+ .long 0x37528AE5 # 0.000012549314 6
+ .long 0x36DA7496 # 0.000006510479 7
+ .long 0x3783B715 # 0.000015701671 8
+ .long 0x383F3E68 # 0.000045596069 9
+ .long 0x38297C10 # 0.000040408282 10
+ .long 0x3815B666 # 0.000035694240 11
+ .long 0x38183854 # 0.000036292084 12
+ .long 0x38448108 # 0.000046850211 13
+ .long 0x373539E9 # 0.000010801924 14
+ .long 0x3864A740 # 0.000054515200 15
+ .long 0x387BE3CD # 0.000060055219 16
+ .long 0x3803B715 # 0.000031403342 17
+ .long 0x380C36AF # 0.000033429529 18
+ .long 0x3892713A # 0.000069829126 19
+ .long 0x38AE55D6 # 0.000083129547 20
+ .long 0x38A0FDE8 # 0.000076766883 21
+ .long 0x3862BAE1 # 0.000054056643 22
+ .long 0x3798AAD3 # 0.000018199358 23
+ .long 0x38C5E10E # 0.000094356117 24
+ .long 0x382D872E # 0.000041372310 25
+ .long 0x38DEDFAC # 0.000106274470 26
+ .long 0x38481E9B # 0.000047712219 27
+ .long 0x38EBFB5E # 0.000112524940 28
+ .long 0x38783B83 # 0.000059183232 29
+ .long 0x374E1B05 # 0.000012284848 30
+ .long 0x38CA0E11 # 0.000096347307 31
+ .long 0x3891F660 # 0.000069600297 32
+ .long 0x386C9A9A # 0.000056410769 33
+ .long 0x38777BCD # 0.000059004688 34
+ .long 0x38A6CED4 # 0.000079540216 35
+ .long 0x38FBE3CD # 0.000120110439 36
+ .long 0x387E7E01 # 0.000060675669 37
+ .long 0x37D40984 # 0.000025276800 38
+ .long 0x3784C3AD # 0.000015826745 39
+ .long 0x380F5FAF # 0.000034182969 40
+ .long 0x38AC47BC # 0.000082149607 41
+ .long 0x392952D3 # 0.000161479504 42
+ .long 0x37F97073 # 0.000029735476 43
+ .long 0x3865C84A # 0.000054784388 44
+ .long 0x3979CF17 # 0.000238236375 45
+ .long 0x38C3D2F5 # 0.000093376184 46
+ .long 0x38E6B468 # 0.000110008579 47
+ .long 0x383EBCE1 # 0.000045475437 48
+ .long 0x39186BDF # 0.000145360347 49
+ .long 0x392F0945 # 0.000166927537 50
+ .long 0x38E9ED45 # 0.000111545007 51
+ .long 0x396B99A8 # 0.000224685878 52
+ .long 0x37A27674 # 0.000019367064 53
+ .long 0x397069AB # 0.000229275480 54
+ .long 0x39013539 # 0.000123222257 55
+ .long 0x3947F423 # 0.000190690669 56
+ .long 0x3945E10E # 0.000188712234 57
+ .long 0x38F85DB0 # 0.000118430122 58
+ .long 0x396C08DC # 0.000225100142 59
+ .long 0x37B4996F # 0.000021529120 60
+ .long 0x397CEADA # 0.000241200818 61
+ .long 0x3920261B # 0.000152729845 62
+ .long 0x35AA4906 # 0.000001268724 63
+ .long 0x3805FDF4 # 0.000031946183 64
+ .long 0 # for alignment
+
diff --git a/src/gas/vrs4logf.S b/src/gas/vrs4logf.S
new file mode 100644
index 0000000..4a39f1c
--- /dev/null
+++ b/src/gas/vrs4logf.S
@@ -0,0 +1,614 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4logf.s
+#
+# A vector implementation of the logf libm function.
+# This routine implemented in single precision. It is slightly
+# less accurate than the double precision version, but it will
+# be better for vectorizing.
+#
+# Prototype:
+#
+# __m128 __vrs4_logf(__m128 x);
+#
+# Computes the natural log of x.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_x,0 # save x
+.equ p_idx,0x010 # xmmword index
+.equ p_z1,0x020 # xmmword index
+.equ p_q,0x030 # xmmword index
+.equ p_corr,0x040 # xmmword index
+.equ p_omask,0x050 # xmmword index
+.equ save_xmm6,0x060 #
+.equ save_rbx,0x070 #
+
+.equ stack_size,0x088
+
+
+
+.globl __vrs4_logf
+ .type __vrs4_logf,@function
+__vrs4_logf:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+# check e as a special case
+ movdqa %xmm0,p_x(%rsp) # save x
+ movdqa %xmm0,%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movmskps %xmm2,%r9d
+
+#
+# compute the index into the log tables
+#
+ movdqa %xmm0,%xmm3
+ movaps %xmm0,%xmm1
+ psrld $23,%xmm3
+ subps .L__real_one(%rip),%xmm1
+ psubd .L__mask_127(%rip),%xmm3
+ cvtdq2ps %xmm3,%xmm6 # xexp
+
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+ xor %r8,%r8
+ movdqa %xmm3,%xmm2
+ movaps .L__real_half(%rip),%xmm5 # .5
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrld $16,%xmm3
+ lea .L__np_ln_lead_table(%rip),%rdx
+ movdqa %xmm3,%xmm4
+ psrld $1,%xmm3
+ paddd .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm4
+ paddd %xmm4,%xmm3
+ cvtdq2ps %xmm3,%xmm1
+ packssdw %xmm3,%xmm3
+ movq %xmm3,p_idx(%rsp)
+
+
+# reduce and get u
+ movdqa %xmm0,%xmm3
+ orps .L__real_half(%rip),%xmm2
+
+
+ mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128
+
+
+ subps %xmm1,%xmm2 # f2 = f - f1
+ mulps %xmm2,%xmm5
+ addps %xmm5,%xmm1
+
+ divps %xmm1,%xmm2 # u
+
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1(%rsp) # save the f1 values
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1+8(%rsp) # save the f1 value
+
+# solve for ln(1+u)
+ movaps %xmm2,%xmm1 # u
+ mulps %xmm2,%xmm2 # u^2
+ movaps %xmm2,%xmm5
+ lea .L__np_ln_tail_table(%rip),%rdx
+ movaps .L__real_cb3(%rip),%xmm3
+ mulps %xmm2,%xmm3 #Cu2
+ mulps %xmm1,%xmm5 # u^3
+ addps .L__real_cb2(%rip),%xmm3 #B+Cu2
+ movaps %xmm2,%xmm4
+ mulps %xmm5,%xmm4 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm2
+
+ mulps .L__real_cb1(%rip),%xmm5 #Au3
+ addps %xmm5,%xmm1 # u+Au3
+ mulps %xmm3,%xmm4 # u5(B+Cu2)
+
+ addps %xmm4,%xmm1 # poly
+
+# recombine
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q(%rsp) # save the f1 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q+8(%rsp) # save the f1 value
+
+ addps p_q(%rsp),%xmm1 #z2 +=q
+
+ movaps p_z1(%rsp),%xmm0 # z1 values
+
+ mulps %xmm6,%xmm2
+ addps %xmm2,%xmm0 #r1
+ mulps .L__real_log2_tail(%rip),%xmm6
+ addps %xmm6,%xmm1 #r2
+ addps %xmm1,%xmm0
+
+# check for e
+ test $0x0f,%r9d
+ jnz .L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+ xorps %xmm1,%xmm1
+ cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm1,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg
+
+.L__f2:
+## if +inf
+ movaps p_x(%rsp),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf
+.L__f3:
+
+ movaps p_x(%rsp),%xmm3
+ subps .L__real_one(%rip),%xmm3
+ andps .L__real_notsign(%rip),%xmm3
+ cmpps $2,.L__real_threshold(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+.L__vlogf_e:
+ movdqa p_x(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ jmp .L__f1
+
+ .align 16
+.L__near_one:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm3,p_omask(%rsp) # save ones mask
+ movaps p_x(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+# u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm1
+ divps %xmm2,%xmm1 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm1,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+# u = u + u;
+ addps %xmm1,%xmm1 #u
+ movaps %xmm1,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm1,%xmm5 # Cu
+ movaps %xmm1,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm1
+ mulps %xmm1,%xmm1 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm1 #u6(Cu+Du3)
+ addps %xmm1,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+# return r + r2;
+ addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm0,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+
+ jmp .L__finish
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+ movdqa %xmm1,%xmm3
+ andps %xmm0,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm1 # setup the nan values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+# check for +/- 0
+ xorps %xmm1,%xmm1
+ cmpps $0,p_x(%rsp),%xmm1 # 0 ?.
+ movmskps %xmm1,%r9d
+ test $0x0f,%r9d
+ jz .L__zn2
+
+ movdqa %xmm1,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+
+.L__zn2:
+# check for NaNs
+ movaps p_x(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x(%rsp),%xmm1 # isolate the NaNs
+ pand %xmm4,%xmm1
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm1,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm1
+ andnps %xmm0,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f2
+
+# handle only +inf log(+inf) = inf
+.L__log_inf:
+ movdqa %xmm3,%xmm1
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps p_x(%rsp),%xmm1 # setup the +inf values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+ jmp .L__f3
+
+
+ .data
+ .align 64
+
+.L__real_zero: .quad 0x00000000000000000 # 1.0
+ .quad 0x00000000000000000
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03c0000003c000000
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+.L__mask_040: .quad 0x00000004000000040 #
+ .quad 0x00000004000000040
+.L__mask_001: .quad 0x00000000100000001 #
+ .quad 0x00000000100000001
+
+
+.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03
+ .quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03
+ .quad 0x03B1249183B124918
+.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04
+ .quad 0x039E401A639E401A6
+.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03
+ .quad 0x03B124A123B124A12
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+
+
+.L__np_ln__table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02
+ .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02
+ .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02
+ .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02
+ .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02
+ .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02
+ .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01
+ .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01
+ .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01
+ .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01
+ .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01
+ .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01
+ .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01
+ .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01
+ .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01
+ .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01
+ .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01
+ .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01
+ .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01
+ .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01
+ .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01
+ .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01
+ .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01
+ .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01
+ .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01
+ .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01
+ .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01
+ .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01
+ .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01
+ .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01
+ .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01
+ .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01
+ .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01
+ .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01
+ .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01
+ .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01
+ .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01
+ .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01
+ .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01
+ .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01
+ .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01
+ .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01
+ .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01
+ .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01
+ .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01
+ .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01
+ .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01
+ .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01
+ .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01
+ .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01
+ .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01
+ .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01
+ .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01
+ .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01
+ .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01
+ .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01
+ .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01
+ .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01
+ .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01
+ .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01
+ .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01
+ .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01
+ .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01
+ .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_lead_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x3C7E0000 # 0.015502929688 1
+ .long 0x3CFC1000 # 0.030769348145 2
+ .long 0x3D3BA000 # 0.045806884766 3
+ .long 0x3D785000 # 0.060623168945 4
+ .long 0x3D9A0000 # 0.075195312500 5
+ .long 0x3DB78000 # 0.089599609375 6
+ .long 0x3DD49000 # 0.103790283203 7
+ .long 0x3DF13000 # 0.117767333984 8
+ .long 0x3E06B000 # 0.131530761719 9
+ .long 0x3E14A000 # 0.145141601563 10
+ .long 0x3E226000 # 0.158569335938 11
+ .long 0x3E2FF000 # 0.171813964844 12
+ .long 0x3E3D5000 # 0.184875488281 13
+ .long 0x3E4A9000 # 0.197814941406 14
+ .long 0x3E579000 # 0.210510253906 15
+ .long 0x3E647000 # 0.223083496094 16
+ .long 0x3E713000 # 0.235534667969 17
+ .long 0x3E7DC000 # 0.247802734375 18
+ .long 0x3E851000 # 0.259887695313 19
+ .long 0x3E8B3000 # 0.271850585938 20
+ .long 0x3E914000 # 0.283691406250 21
+ .long 0x3E974000 # 0.295410156250 22
+ .long 0x3E9D3000 # 0.307006835938 23
+ .long 0x3EA30000 # 0.318359375000 24
+ .long 0x3EA8D000 # 0.329711914063 25
+ .long 0x3EAE8000 # 0.340820312500 26
+ .long 0x3EB43000 # 0.351928710938 27
+ .long 0x3EB9C000 # 0.362792968750 28
+ .long 0x3EBF5000 # 0.373657226563 29
+ .long 0x3EC4D000 # 0.384399414063 30
+ .long 0x3ECA3000 # 0.394897460938 31
+ .long 0x3ECF9000 # 0.405395507813 32
+ .long 0x3ED4E000 # 0.415771484375 33
+ .long 0x3EDA2000 # 0.426025390625 34
+ .long 0x3EDF5000 # 0.436157226563 35
+ .long 0x3EE47000 # 0.446166992188 36
+ .long 0x3EE99000 # 0.456176757813 37
+ .long 0x3EEEA000 # 0.466064453125 38
+ .long 0x3EF3A000 # 0.475830078125 39
+ .long 0x3EF89000 # 0.485473632813 40
+ .long 0x3EFD7000 # 0.494995117188 41
+ .long 0x3F012000 # 0.504394531250 42
+ .long 0x3F039000 # 0.513916015625 43
+ .long 0x3F05F000 # 0.523193359375 44
+ .long 0x3F084000 # 0.532226562500 45
+ .long 0x3F0AA000 # 0.541503906250 46
+ .long 0x3F0CF000 # 0.550537109375 47
+ .long 0x3F0F4000 # 0.559570312500 48
+ .long 0x3F118000 # 0.568359375000 49
+ .long 0x3F13C000 # 0.577148437500 50
+ .long 0x3F160000 # 0.585937500000 51
+ .long 0x3F183000 # 0.594482421875 52
+ .long 0x3F1A7000 # 0.603271484375 53
+ .long 0x3F1C9000 # 0.611572265625 54
+ .long 0x3F1EC000 # 0.620117187500 55
+ .long 0x3F20E000 # 0.628417968750 56
+ .long 0x3F230000 # 0.636718750000 57
+ .long 0x3F252000 # 0.645019531250 58
+ .long 0x3F273000 # 0.653076171875 59
+ .long 0x3F295000 # 0.661376953125 60
+ .long 0x3F2B5000 # 0.669189453125 61
+ .long 0x3F2D6000 # 0.677246093750 62
+ .long 0x3F2F7000 # 0.685302734375 63
+ .long 0x3F317000 # 0.693115234375 64
+ .long 0 # for alignment
+
+.L__np_ln_tail_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x35A8B0FC # 0.000001256848 1
+ .long 0x361B0E78 # 0.000002310522 2
+ .long 0x3631EC66 # 0.000002651266 3
+ .long 0x35C30046 # 0.000001452871 4
+ .long 0x37EBCB0E # 0.000028108738 5
+ .long 0x37528AE5 # 0.000012549314 6
+ .long 0x36DA7496 # 0.000006510479 7
+ .long 0x3783B715 # 0.000015701671 8
+ .long 0x383F3E68 # 0.000045596069 9
+ .long 0x38297C10 # 0.000040408282 10
+ .long 0x3815B666 # 0.000035694240 11
+ .long 0x38183854 # 0.000036292084 12
+ .long 0x38448108 # 0.000046850211 13
+ .long 0x373539E9 # 0.000010801924 14
+ .long 0x3864A740 # 0.000054515200 15
+ .long 0x387BE3CD # 0.000060055219 16
+ .long 0x3803B715 # 0.000031403342 17
+ .long 0x380C36AF # 0.000033429529 18
+ .long 0x3892713A # 0.000069829126 19
+ .long 0x38AE55D6 # 0.000083129547 20
+ .long 0x38A0FDE8 # 0.000076766883 21
+ .long 0x3862BAE1 # 0.000054056643 22
+ .long 0x3798AAD3 # 0.000018199358 23
+ .long 0x38C5E10E # 0.000094356117 24
+ .long 0x382D872E # 0.000041372310 25
+ .long 0x38DEDFAC # 0.000106274470 26
+ .long 0x38481E9B # 0.000047712219 27
+ .long 0x38EBFB5E # 0.000112524940 28
+ .long 0x38783B83 # 0.000059183232 29
+ .long 0x374E1B05 # 0.000012284848 30
+ .long 0x38CA0E11 # 0.000096347307 31
+ .long 0x3891F660 # 0.000069600297 32
+ .long 0x386C9A9A # 0.000056410769 33
+ .long 0x38777BCD # 0.000059004688 34
+ .long 0x38A6CED4 # 0.000079540216 35
+ .long 0x38FBE3CD # 0.000120110439 36
+ .long 0x387E7E01 # 0.000060675669 37
+ .long 0x37D40984 # 0.000025276800 38
+ .long 0x3784C3AD # 0.000015826745 39
+ .long 0x380F5FAF # 0.000034182969 40
+ .long 0x38AC47BC # 0.000082149607 41
+ .long 0x392952D3 # 0.000161479504 42
+ .long 0x37F97073 # 0.000029735476 43
+ .long 0x3865C84A # 0.000054784388 44
+ .long 0x3979CF17 # 0.000238236375 45
+ .long 0x38C3D2F5 # 0.000093376184 46
+ .long 0x38E6B468 # 0.000110008579 47
+ .long 0x383EBCE1 # 0.000045475437 48
+ .long 0x39186BDF # 0.000145360347 49
+ .long 0x392F0945 # 0.000166927537 50
+ .long 0x38E9ED45 # 0.000111545007 51
+ .long 0x396B99A8 # 0.000224685878 52
+ .long 0x37A27674 # 0.000019367064 53
+ .long 0x397069AB # 0.000229275480 54
+ .long 0x39013539 # 0.000123222257 55
+ .long 0x3947F423 # 0.000190690669 56
+ .long 0x3945E10E # 0.000188712234 57
+ .long 0x38F85DB0 # 0.000118430122 58
+ .long 0x396C08DC # 0.000225100142 59
+ .long 0x37B4996F # 0.000021529120 60
+ .long 0x397CEADA # 0.000241200818 61
+ .long 0x3920261B # 0.000152729845 62
+ .long 0x35AA4906 # 0.000001268724 63
+ .long 0x3805FDF4 # 0.000031946183 64
+ .long 0 # for alignment
+
diff --git a/src/gas/vrs4powf.S b/src/gas/vrs4powf.S
new file mode 100644
index 0000000..42b005d
--- /dev/null
+++ b/src/gas/vrs4powf.S
@@ -0,0 +1,623 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4powf.s
+#
+# A vector implementation of the powf libm function.
+#
+# Prototype:
+#
+# __m128 __vrs4_powf(__m128 x,__m128 y);
+#
+# Computes x raised to the y power. Returns proper C99 values.
+# Uses new tuned fastlog/fastexp.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ p_temp,0x00 # xmmword
+.equ p_negateres,0x10 # qword
+
+.equ p_xexp,0x20 # qword
+
+.equ p_ux,0x030 # storage for X
+.equ p_uy,0x040 # storage for Y
+
+.equ p_ax,0x050 # absolute x
+.equ p_sx,0x060 # sign of x's
+
+.equ p_ay,0x070 # absolute y
+.equ p_yexp,0x080 # unbiased exponent of y
+
+.equ p_inty,0x090 # integer y indicators
+.equ save_rbx,0x0A0 #
+
+.equ stack_size,0x0B8 # allocate 40h more than
+ # we need to avoid bank conflicts
+
+
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrs4_powf
+ .type __vrs4_powf,@function
+__vrs4_powf:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+ movaps %xmm0,p_ux(%rsp) # save x
+ movaps %xmm1,p_uy(%rsp) # save y
+
+ movaps %xmm0,%xmm2
+ andps .L__mask_nsign(%rip),%xmm0 # get abs x
+ andps .L__mask_sign(%rip),%xmm2 # mask for the sign bits
+ movaps %xmm0,p_ax(%rsp) # save them
+ movaps %xmm2,p_sx(%rsp) # save them
+# convert all four x's to double
+ cvtps2pd p_ax(%rsp),%xmm0
+ cvtps2pd p_ax+8(%rsp),%xmm1
+#
+# classify y
+# vector 32 bit integer method 25 cycles to here
+# /* See whether y is an integer.
+# inty = 0 means not an integer.
+# inty = 1 means odd integer.
+# inty = 2 means even integer.
+# */
+ movdqa p_uy(%rsp),%xmm4
+ pxor %xmm3,%xmm3
+ pand .L__mask_nsign(%rip),%xmm4 # get abs y in integer format
+ movdqa %xmm4,p_ay(%rsp) # save it
+
+# see if the number is less than 1.0
+ psrld $23,%xmm4 #>> EXPSHIFTBITS_SP32
+
+ psubd .L__mask_127(%rip),%xmm4 # yexp, unbiased exponent
+ movdqa %xmm4,p_yexp(%rsp) # save it
+ paddd .L__mask_1(%rip),%xmm4 # yexp+1
+ pcmpgtd %xmm3,%xmm4 # 0 if exp less than 126 (2^0) (y < 1.0), else FFs
+# xmm4 is ffs if abs(y) >=1.0, else 0
+
+# see if the mantissa has fractional bits
+#build mask for mantissa
+ movdqa .L__mask_23(%rip),%xmm2
+ psubd p_yexp(%rsp),%xmm2 # 24-yexp
+ pmaxsw %xmm3,%xmm2 # no shift counts less than 0
+ movdqa %xmm2,p_temp(%rsp) # save the shift counts
+# create mask for all four values
+# SSE can't individual shifts so have to do 0xeac one seperately
+ mov p_temp(%rsp),%rcx
+ mov $1,%rbx
+ shl %cl,%ebx #1 << (24 - yexp)
+ shr $32,%rcx
+ mov $1,%eax
+ shl %cl,%eax #1 << (24 - yexp)
+ shl $32,%rax
+ add %rax,%rbx
+ mov %rbx,p_temp(%rsp)
+ mov p_temp+8(%rsp),%rcx
+ mov $1,%rbx
+ shl %cl,%ebx #1 << (24 - yexp)
+ shr $32,%rcx
+ mov $1,%eax
+ shl %cl,%eax #1 << (24 - yexp)
+ shl $32,%rax
+ add %rbx,%rax
+ mov %rax,p_temp+8(%rsp)
+ movdqa p_temp(%rsp),%xmm5
+ psubd .L__mask_1(%rip),%xmm5 #= mask = (1 << (24 - yexp)) - 1
+
+# now use the mask to see if there are any fractional bits
+ movdqa p_uy(%rsp),%xmm2 # get uy
+ pand %xmm5,%xmm2 # uy & mask
+ pcmpeqd %xmm3,%xmm2 # 0 if not zero (y has fractional mantissa bits), else FFs
+ pand %xmm4,%xmm2 # either 0s or ff
+# xmm2 now accounts for y< 1.0 or y>=1.0 and y has fractional mantissa bits,
+# it has the value 0 if we know it's non-integer or ff if integer.
+
+# now see if it's even or odd.
+
+## if yexp > 24, then it has to be even
+ movdqa .L__mask_24(%rip),%xmm4
+ psubd p_yexp(%rsp),%xmm4 # 24-yexp
+ paddd .L__mask_1(%rip),%xmm5 # mask+1 = least significant integer bit
+ pcmpgtd %xmm3,%xmm4 ## if 0, then must be even, else ff's
+
+ pand %xmm4,%xmm5 # set the integer bit mask to zero if yexp>24
+ paddd .L__mask_2(%rip),%xmm4
+ por .L__mask_2(%rip),%xmm4
+ pand %xmm2,%xmm4 # result can be 0, 2, or 3
+
+# now for integer numbers, see if odd or even
+ pand .L__mask_mant(%rip),%xmm5 # mask out exponent bits
+ movdqa .L__float_one(%rip),%xmm2
+ pand p_uy(%rsp),%xmm5 # & uy -> even or odd
+ pcmpeqd p_ay(%rsp),%xmm2 # is ay equal to 1, ff's if so, then it's odd
+ pand .L__mask_nsign(%rip),%xmm2 # strip the sign bit so the gt comparison works.
+ por %xmm2,%xmm5
+ pcmpgtd %xmm3,%xmm5 ## if odd then ff's, else 0's for even
+ paddd .L__mask_2(%rip),%xmm5 # gives us 2 for even, 1 for odd
+ pand %xmm5,%xmm4
+
+ movdqa %xmm4,p_inty(%rsp) # save inty
+#
+# do more x special case checking
+#
+ movdqa %xmm4,%xmm5
+ pcmpeqd %xmm3,%xmm5 # is not an integer? ff's if so
+ pand .L__mask_NaN(%rip),%xmm5 # these values will be NaNs, if x<0
+ movdqa %xmm4,%xmm2
+ pcmpeqd .L__mask_1(%rip),%xmm2 # is it odd? ff's if so
+ pand .L__mask_sign(%rip),%xmm2 # these values will get their sign bit set
+ por %xmm2,%xmm5
+
+ pcmpeqd p_sx(%rsp),%xmm3 ## if the signs are set
+ pandn %xmm5,%xmm3 # then negateres gets the values as shown below
+ movdqa %xmm3,p_negateres(%rsp) # save negateres
+
+# /* p_negateres now means the following.
+# ** 7FC00000 means x<0, y not an integer, return NaN.
+# ** 80000000 means x<0, y is odd integer, so set the sign bit.
+# ** 0 means even integer, and/or x>=0.
+# */
+
+
+# **** Here starts the main calculations ****
+# The algorithm used is x**y = exp(y*log(x))
+# Extra precision is required in intermediate steps to meet the 1ulp requirement
+#
+# log(x) calculation
+ call __vrd4_log@PLT # get the double precision log value
+ # for all four x's
+# y* logx
+# convert all four y's to double
+ lea p_uy(%rsp),%rdx # get pointer to y
+ cvtps2pd (%rdx),%xmm2
+ cvtps2pd 8(%rdx),%xmm3
+
+# /* just multiply by y */
+ mulpd %xmm2,%xmm0
+ mulpd %xmm3,%xmm1
+
+# /* The following code computes r = exp(w) */
+ call __vrd4_exp@PLT # get the double exp value
+ # for all four y*log(x)'s
+#
+# convert all four results to double
+ cvtpd2ps %xmm0,%xmm0
+ cvtpd2ps %xmm1,%xmm1
+ movlhps %xmm1,%xmm0
+
+# perform special case and error checking on input values
+
+# special case checking is done first in the scalar version since
+# it allows for early fast returns. But for vectors, we consider them
+# to be rare, so early returns are not necessary. So we first compute
+# the x**y values, and then check for special cases.
+
+# we do some of the checking in reverse order of the scalar version.
+ lea p_uy(%rsp),%rdx # get pointer to y
+# apply the negate result flags
+ orps p_negateres(%rsp),%xmm0 # get negateres
+
+## if y is infinite or so large that the result would overflow or underflow
+ movdqa p_ay(%rsp),%xmm4
+ cmpps $5,.L__mask_ly(%rip),%xmm4 # y not less than large value, ffs if so.
+ movmskps %xmm4,%edx
+ test $0x0f,%edx
+ jnz .Ly_large
+.Lrnsx3:
+
+## if x is infinite
+ movdqa p_ax(%rsp),%xmm4
+ cmpps $0,.L__mask_inf(%rip),%xmm4 # equal to infinity, ffs if so.
+ movmskps %xmm4,%edx
+ test $0x0f,%edx
+ jnz .Lx_infinite
+.Lrnsx1:
+## if x is zero
+ xorps %xmm4,%xmm4
+ cmpps $0,p_ax(%rsp),%xmm4 # equal to zero, ffs if so.
+ movmskps %xmm4,%edx
+ test $0x0f,%edx
+ jnz .Lx_zero
+.Lrnsx2:
+## if y is NAN
+ lea p_uy(%rsp),%rdx # get pointer to y
+ movdqa (%rdx),%xmm4 # get y
+ cmpps $4,%xmm4,%xmm4 # a compare not equal of y to itself should
+ # be false, unless y is a NaN. ff's if NaN.
+ movmskps %xmm4,%ecx
+ test $0x0f,%ecx
+ jnz .Ly_NaN
+.Lrnsx4:
+## if x is NAN
+ lea p_ux(%rsp),%rdx # get pointer to x
+ movdqa (%rdx),%xmm4 # get x
+ cmpps $4,%xmm4,%xmm4 # a compare not equal of x to itself should
+ # be false, unless x is a NaN. ff's if NaN.
+ movmskps %xmm4,%ecx
+ test $0x0f,%ecx
+ jnz .Lx_NaN
+.Lrnsx5:
+
+## if |y| == 0 then return 1
+ movdqa .L__float_one(%rip),%xmm3 # one
+ xorps %xmm2,%xmm2
+ cmpps $4,p_ay(%rsp),%xmm2 # not equal to 0.0?, ffs if not equal.
+ andps %xmm2,%xmm0 # keep the others
+ andnps %xmm3,%xmm2 # mask for ones
+ orps %xmm2,%xmm0
+## if x == +1, return +1 for all x
+ lea p_ux(%rsp),%rdx # get pointer to x
+ movdqa %xmm3,%xmm2
+ cmpps $4,(%rdx),%xmm2 # not equal to +1.0?, ffs if not equal.
+ andps %xmm2,%xmm0 # keep the others
+ andnps %xmm3,%xmm2 # mask for ones
+ orps %xmm2,%xmm0
+
+.L__powf_cleanup2:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+# y is a NaN.
+.Ly_NaN:
+ lea p_uy(%rsp),%rdx # get pointer to y
+ movdqa (%rdx),%xmm4 # get y
+ movdqa %xmm4,%xmm3
+ movdqa %xmm4,%xmm5
+ movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits
+ cmpps $0,%xmm4,%xmm4 # a compare equal of y to itself should
+ # be true, unless y is a NaN. 0's if NaN.
+ cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN.
+ andps %xmm4,%xmm0 # keep the other results
+ andps %xmm3,%xmm2 # get just the right signalling bits
+ andps %xmm5,%xmm3 # mask for the NaNs
+ orps %xmm2,%xmm3 # convert to QNaNs
+ orps %xmm3,%xmm0 # combine
+ jmp .Lrnsx4
+
+# y is a NaN.
+.Lx_NaN:
+ lea p_ux(%rsp),%rcx # get pointer to x
+ movdqa (%rcx),%xmm4 # get x
+ movdqa %xmm4,%xmm3
+ movdqa %xmm4,%xmm5
+ movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits
+ cmpps $0,%xmm4,%xmm4 # a compare equal of x to itself should
+ # be true, unless x is a NaN. 0's if NaN.
+ cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN.
+ andps %xmm4,%xmm0 # keep the other results
+ andps %xmm3,%xmm2 # get just the right signalling bits
+ andps %xmm5,%xmm3 # mask for the NaNs
+ orps %xmm2,%xmm3 # convert to QNaNs
+ orps %xmm3,%xmm0 # combine
+ jmp .Lrnsx5
+
+# * y is infinite or so large that the result would
+# overflow or underflow.
+.Ly_large:
+ movdqa %xmm0,p_temp(%rsp)
+
+ test $1,%edx
+ jz .Lylrga
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov (%rcx),%eax
+ mov (%rbx),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+.Lylrga:
+ test $2,%edx
+ jz .Lylrgb
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov 4(%rcx),%eax
+ mov 4(%rbx),%ebx
+ mov p_inty+4(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+.Lylrgb:
+ test $4,%edx
+ jz .Lylrgc
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov 8(%rcx),%eax
+ mov 8(%rbx),%ebx
+ mov p_inty+8(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+.Lylrgc:
+ test $8,%edx
+ jz .Lylrgd
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov 12(%rcx),%eax
+ mov 12(%rbx),%ebx
+ mov p_inty+12(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+.Lylrgd:
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx3
+
+# a subroutine to treat an individual x,y pair when y is large or infinity
+# assumes x in .Ly,%eax in ebx.
+# returns result in eax
+.Lnp_special6:
+# handle |x|==1 cases first
+ mov $0x07FFFFFFF,%r8d
+ and %eax,%r8d
+ cmp $0x03f800000,%r8d # jump if |x| !=1
+ jnz .Lnps6
+ mov $0x03f800000,%eax # return 1 for all |x|==1
+ jmp .Lnpx64
+
+# cases where |x| !=1
+.Lnps6:
+ mov $0x07f800000,%ecx
+ xor %eax,%eax # assume 0 return
+ test $0x080000000,%ebx
+ jnz .Lnps62 # jump if y negative
+# y = +inf
+ cmp $0x03f800000,%r8d
+ cmovg %ecx,%eax # return inf if |x| < 1
+ jmp .Lnpx64
+.Lnps62:
+# y = -inf
+ cmp $0x03f800000,%r8d
+ cmovl %ecx,%eax # return inf if |x| < 1
+ jmp .Lnpx64
+
+.Lnpx64:
+ ret
+
+# handle cases where x is +/- infinity. edx is the mask
+ .align 16
+.Lx_infinite:
+ movdqa %xmm0,p_temp(%rsp)
+
+ test $1,%edx
+ jz .Lxinfa
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov (%rcx),%eax
+ mov (%rbx),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+.Lxinfa:
+ test $2,%edx
+ jz .Lxinfb
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov 4(%rcx),%eax
+ mov 4(%rbx),%ebx
+ mov p_inty+4(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+.Lxinfb:
+ test $4,%edx
+ jz .Lxinfc
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov 8(%rcx),%eax
+ mov 8(%rbx),%ebx
+ mov p_inty+8(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+.Lxinfc:
+ test $8,%edx
+ jz .Lxinfd
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov 12(%rcx),%eax
+ mov 12(%rbx),%ebx
+ mov p_inty+12(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+.Lxinfd:
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx1
+
+# a subroutine to treat an individual x,y pair when x is +/-infinity
+# assumes x in .Ly,%eax in ebx, inty in ecx.
+# returns result in eax
+.Lnp_special_x1: # x is infinite
+ test $0x080000000,%eax # is x positive
+ jnz .Lnsx11 # jump if not
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 # just return if so
+ xor %eax,%eax # else return 0
+ jmp .Lnsx13
+
+.Lnsx11:
+ cmp $1,%ecx ## if inty ==1
+ jnz .Lnsx12 # jump if not
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 # just return if so
+ mov $0x080000000,%eax # else return -0
+ jmp .Lnsx13
+.Lnsx12: # inty <>1
+ and $0x07FFFFFFF,%eax # return -x (|x|) if y<0
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 #
+ xor %eax,%eax # return 0 if y >=0
+.Lnsx13:
+ ret
+
+
+# handle cases where x is +/- zero. edx is the mask of x,y pairs with |x|=0
+ .align 16
+.Lx_zero:
+ movdqa %xmm0,p_temp(%rsp)
+
+ test $1,%edx
+ jz .Lxzera
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov (%rcx),%eax
+ mov (%rbx),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+.Lxzera:
+ test $2,%edx
+ jz .Lxzerb
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov 4(%rcx),%eax
+ mov 4(%rbx),%ebx
+ mov p_inty+4(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+.Lxzerb:
+ test $4,%edx
+ jz .Lxzerc
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov 8(%rcx),%eax
+ mov 8(%rbx),%ebx
+ mov p_inty+8(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+.Lxzerc:
+ test $8,%edx
+ jz .Lxzerd
+ lea p_ux(%rsp),%rcx # get pointer to x
+ lea p_uy(%rsp),%rbx # get pointer to y
+ mov 12(%rcx),%eax
+ mov 12(%rbx),%ebx
+ mov p_inty+12(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+.Lxzerd:
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx2
+
+# a subroutine to treat an individual x,y pair when x is +/-0
+# assumes x in .Ly,%eax in ebx, inty in ecx.
+# returns result in eax
+ .align 16
+.Lnp_special_x2:
+ cmp $1,%ecx ## if inty ==1
+ jz .Lnsx21 # jump if so
+# handle cases of x=+/-0, y not integer
+ xor %eax,%eax
+ mov $0x07f800000,%ecx
+ test $0x080000000,%ebx # is ypos
+ cmovnz %ecx,%eax
+ jmp .Lnsx23
+# y is an integer
+.Lnsx21:
+ xor %r8d,%r8d
+ mov $0x07f800000,%ecx
+ test $0x080000000,%ebx # is ypos
+ cmovnz %ecx,%r8d # set to infinity if not
+ and $0x080000000,%eax # pickup the sign of x
+ or %r8d,%eax # and include it in the result
+.Lnsx23:
+ ret
+
+
+ .data
+ .align 64
+
+.L__mask_sign: .quad 0x08000000080000000 # a sign bit mask
+ .quad 0x08000000080000000
+
+.L__mask_nsign: .quad 0x07FFFFFFF7FFFFFFF # a not sign bit mask
+ .quad 0x07FFFFFFF7FFFFFFF
+
+# used by inty
+.L__mask_127: .quad 0x00000007F0000007F # EXPBIAS_SP32
+ .quad 0x00000007F0000007F
+
+.L__mask_mant: .quad 0x0007FFFFF007FFFFF # mantissa bit mask
+ .quad 0x0007FFFFF007FFFFF
+
+.L__mask_1: .quad 0x00000000100000001 # 1
+ .quad 0x00000000100000001
+
+.L__mask_2: .quad 0x00000000200000002 # 2
+ .quad 0x00000000200000002
+
+.L__mask_24: .quad 0x00000001800000018 # 24
+ .quad 0x00000001800000018
+
+.L__mask_23: .quad 0x00000001700000017 # 23
+ .quad 0x00000001700000017
+
+# used by special case checking
+
+.L__float_one: .quad 0x03f8000003f800000 # one
+ .quad 0x03f8000003f800000
+
+.L__mask_inf: .quad 0x07f8000007F800000 # inifinity
+ .quad 0x07f8000007F800000
+
+.L__mask_NaN: .quad 0x07fC000007FC00000 # NaN
+ .quad 0x07fC000007FC00000
+
+.L__mask_sigbit: .quad 0x00040000000400000 # QNaN bit
+ .quad 0x00040000000400000
+
+.L__mask_ly: .quad 0x04f0000004f000000 # large y
+ .quad 0x04f0000004f000000
diff --git a/src/gas/vrs4powxf.S b/src/gas/vrs4powxf.S
new file mode 100644
index 0000000..e18b5db
--- /dev/null
+++ b/src/gas/vrs4powxf.S
@@ -0,0 +1,538 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4powxf.asm
+#
+# A vector implementation of the powf libm function.
+# This routine raises the x vector to a constant y power.
+#
+# Prototype:
+#
+# __m128 __vrs4_powxf(__m128 x,float y);
+#
+# Computes x raised to the y power. Returns proper C99 values.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+
+# define local variable storage offsets
+.equ p_temp,0x00 # xmmword
+.equ p_negateres,0x10 # qword
+
+.equ save_rbx,0x020 #qword
+.equ save_rsi,0x028 #qword
+
+.equ p_xptr,0x030 # ptr to x values
+.equ p_y,0x038 # y value
+
+.equ p_inty,0x040 # integer y indicators
+
+.equ p_ux,0x050 # absolute x
+.equ p_ax,0x060 # absolute x
+.equ p_sx,0x070 # sign of x's
+
+.equ stack_size,0x088 #
+
+
+
+
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrs4_powxf
+ .type __vrs4_powxf,@function
+__vrs4_powxf:
+
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+ lea p_ux(%rsp),%rcx
+ mov %rcx,p_xptr(%rsp) # save pointer to x
+ movaps %xmm0,(%rcx)
+ movss %xmm1,p_y(%rsp) # save y
+
+ movdqa %xmm1,%xmm4
+
+ movaps %xmm0,%xmm2
+ andps .L__mask_nsign(%rip),%xmm0 # get abs x
+ andps .L__mask_sign(%rip),%xmm2 # mask for the sign bits
+ movaps %xmm0,p_ax(%rsp) # save them
+ movaps %xmm2,p_sx(%rsp) # save them
+# convert all four x's to double
+ cvtps2pd p_ax(%rsp),%xmm0
+ cvtps2pd p_ax+8(%rsp),%xmm1
+#
+# classify y
+# vector 32 bit integer method 25 cycles to here
+# /* See whether y is an integer.
+# inty = 0 means not an integer.
+# */
+# get yexp
+ mov p_y(%rsp),%r8d # r8 is uy
+ mov $0x07fffffff,%r9d
+ and %r8d,%r9d # r9 is ay
+
+## if |y| == 0 then return 1
+ cmp $0,%r9d # is y a zero?
+ jz .Ly_zero
+
+ mov $0x07f800000,%eax # EXPBITS_SP32
+ and %r9d,%eax # y exp
+
+ xor %edi,%edi
+ shr $23,%eax #>> EXPSHIFTBITS_SP32
+ sub $126,%eax # - EXPBIAS_SP32 + 1 - eax is now the unbiased exponent
+ mov $1,%ebx
+ cmp %ebx,%eax ## if (yexp < 1)
+ cmovl %edi,%ebx
+ jl .Lsave_inty
+
+ mov $24,%ecx
+ cmp %ecx,%eax ## if (yexp >24)
+ jle .Linfy1
+ mov $2,%ebx
+ jmp .Lsave_inty
+.Linfy1: # else 1<=yexp<=24
+ sub %eax,%ecx # build mask for mantissa
+ shl %cl,%ebx
+ dec %ebx # rbx = mask = (1 << (24 - yexp)) - 1
+
+ mov %r8d,%eax
+ and %ebx,%eax ## if ((uy & mask) != 0)
+ cmovnz %edi,%ebx # inty = 0;
+ jnz .Lsave_inty
+
+ not %ebx # else if (((uy & ~mask) >> (24 - yexp)) & 0x00000001)
+ mov %r8d,%eax
+ and %ebx,%eax
+ shr %cl,%eax
+ inc %edi
+ and %edi,%eax
+ mov %edi,%ebx # inty = 1
+ jnz .Lsave_inty
+ inc %ebx # else inty = 2
+
+
+.Lsave_inty:
+ mov %r8d,p_y+4(%rsp) # r8d is ay
+ mov %ebx,p_inty(%rsp) # save inty
+#
+# do more x special case checking
+#
+ pxor %xmm3,%xmm3
+ xor %eax,%eax
+ mov $0x07FC00000,%ecx
+ cmp $0,%ebx # is y not an integer?
+ cmovz %ecx,%eax # then set to return a NaN. else 0.
+ mov $0x080000000,%ecx
+ cmp $1,%ebx # is y an odd integer?
+ cmovz %ecx,%eax # maybe set sign bit if so
+ movd %eax,%xmm5
+ pshufd $0,%xmm5,%xmm5
+
+ pcmpeqd p_sx(%rsp),%xmm3 ## if the signs are set
+ pandn %xmm5,%xmm3 # then negateres gets the values as shown below
+ movdqa %xmm3,p_negateres(%rsp) # save negateres
+
+# /* p_negateres now means the following.
+# 7FC00000 means x<0, y not an integer, return NaN.
+# 80000000 means x<0, y is odd integer, so set the sign bit.
+## 0 means even integer, and/or x>=0.
+# */
+
+# **** Here starts the main calculations ****
+# The algorithm used is x**y = exp(y*log(x))
+# Extra precision is required in intermediate steps to meet the 1ulp requirement
+#
+# log(x) calculation
+ call __vrd4_log@PLT # get the double precision log value
+ # for all four x's
+# y* logx
+ cvtps2pd p_y(%rsp),%xmm2 #convert the two packed single y's to double
+
+# /* just multiply by y */
+ mulpd %xmm2,%xmm0
+ mulpd %xmm2,%xmm1
+
+# /* The following code computes r = exp(w) */
+ call __vrd4_exp@PLT # get the double exp value
+ # for all four y*log(x)'s
+#
+# convert all four results to double
+ cvtpd2ps %xmm0,%xmm0
+ cvtpd2ps %xmm1,%xmm1
+ movlhps %xmm1,%xmm0
+
+# perform special case and error checking on input values
+
+# special case checking is done first in the scalar version since
+# it allows for early fast returns. But for vectors, we consider them
+# to be rare, so early returns are not necessary. So we first compute
+# the x**y values, and then check for special cases.
+
+# we do some of the checking in reverse order of the scalar version.
+# apply the negate result flags
+ orps p_negateres(%rsp),%xmm0 # get negateres
+
+## if y is infinite or so large that the result would overflow or underflow
+ mov p_y(%rsp),%edx # get y
+ and $0x07fffffff,%edx # develop ay
+ cmp $0x04f000000,%edx
+ ja .Ly_large
+.Lrnsx3:
+
+## if x is infinite
+ movdqa p_ax(%rsp),%xmm4
+ cmpps $0,.L__mask_inf(%rip),%xmm4 # equal to infinity, ffs if so.
+ movmskps %xmm4,%edx
+ test $0x0f,%edx
+ jnz .Lx_infinite
+.Lrnsx1:
+## if x is zero
+ xorps %xmm4,%xmm4
+ cmpps $0,p_ax(%rsp),%xmm4 # equal to zero, ffs if so.
+ movmskps %xmm4,%edx
+ test $0x0f,%edx
+ jnz .Lx_zero
+.Lrnsx2:
+## if y is NAN
+ movss p_y(%rsp),%xmm4 # get y
+ ucomiss %xmm4,%xmm4 # comparing y to itself should
+ # be true, unless y is a NaN. parity flag if NaN.
+ jp .Ly_NaN
+.Lrnsx4:
+## if x is NAN
+ movdqa p_ax(%rsp),%xmm4 # get x
+ cmpps $4,%xmm4,%xmm4 # a compare not equal of x to itself should
+ # be false, unless x is a NaN. ff's if NaN.
+ movmskps %xmm4,%ecx
+ test $0x0f,%ecx
+ jnz .Lx_NaN
+.Lrnsx5:
+
+## if x == +1, return +1 for all x
+ movdqa .L__float_one(%rip),%xmm3 # one
+ mov p_xptr(%rsp),%rdx # get pointer to x
+ movdqa %xmm3,%xmm2
+ cmpps $4,(%rdx),%xmm2 # not equal to +1.0?, ffs if not equal.
+ andps %xmm2,%xmm0 # keep the others
+ andnps %xmm3,%xmm2 # mask for ones
+ orps %xmm2,%xmm0
+
+.L__powf_cleanup2:
+
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+.Ly_zero:
+## if |y| == 0 then return 1
+ movdqa .L__float_one(%rip),%xmm0 # one
+ jmp .L__powf_cleanup2
+# * y is a NaN.
+.Ly_NaN:
+ mov p_y(%rsp),%r8d
+ or $0x000400000,%r8d # convert to QNaNs
+ movd %r8d,%xmm0 # propagate to all results
+ shufps $0,%xmm0,%xmm0
+ jmp .Lrnsx4
+
+# y is a NaN.
+.Lx_NaN:
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ movdqa (%rcx),%xmm4 # get x
+ movdqa %xmm4,%xmm3
+ movdqa %xmm4,%xmm5
+ movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits
+ cmpps $0,%xmm4,%xmm4 # a compare equal of x to itself should
+ # be true, unless x is a NaN. 0's if NaN.
+ cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN.
+ andps %xmm4,%xmm0 # keep the other results
+ andps %xmm3,%xmm2 # get just the right signalling bits
+ andps %xmm5,%xmm3 # mask for the NaNs
+ orps %xmm2,%xmm3 # convert to QNaNs
+ orps %xmm3,%xmm0 # combine
+ jmp .Lrnsx5
+
+# * y is infinite or so large that the result would
+# overflow or underflow.
+.Ly_large:
+ movdqa %xmm0,p_temp(%rsp)
+
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov (%rcx),%eax
+ mov p_y(%rsp),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov 4(%rcx),%eax
+ mov p_y(%rsp),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov 8(%rcx),%eax
+ mov p_y(%rsp),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov 12(%rcx),%eax
+ mov p_y(%rsp),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx3
+
+# a subroutine to treat an individual x,y pair when y is large or infinity
+# assumes x in .Ly(%rip),%eax in ebx.
+# returns result in eax
+.Lnp_special6:
+# handle |x|==1 cases first
+ mov $0x07FFFFFFF,%r8d
+ and %eax,%r8d
+ cmp $0x03f800000,%r8d # jump if |x| !=1
+ jnz .Lnps6
+ mov $0x03f800000,%eax # return 1 for all |x|==1
+ jmp .Lnpx64
+
+# cases where |x| !=1
+.Lnps6:
+ mov $0x07f800000,%ecx
+ xor %eax,%eax # assume 0 return
+ test $0x080000000,%ebx
+ jnz .Lnps62 # jump if y negative
+# y = +inf
+ cmp $0x03f800000,%r8d
+ cmovg %ecx,%eax # return inf if |x| < 1
+ jmp .Lnpx64
+.Lnps62:
+# y = -inf
+ cmp $0x03f800000,%r8d
+ cmovl %ecx,%eax # return inf if |x| < 1
+ jmp .Lnpx64
+
+.Lnpx64:
+ ret
+
+# handle cases where x is +/- infinity. edx is the mask
+ .align 16
+.Lx_infinite:
+ movdqa %xmm0,p_temp(%rsp)
+
+ test $1,%edx
+ jz .Lxinfa
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov (%rcx),%eax
+ mov p_y(%rsp),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+.Lxinfa:
+ test $2,%edx
+ jz .Lxinfb
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 4(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+.Lxinfb:
+ test $4,%edx
+ jz .Lxinfc
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 8(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+.Lxinfc:
+ test $8,%edx
+ jz .Lxinfd
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 12(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+.Lxinfd:
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx1
+
+# a subroutine to treat an individual x,y pair when x is +/-infinity
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+.Lnp_special_x1: # x is infinite
+ test $0x080000000,%eax # is x positive
+ jnz .Lnsx11 # jump if not
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 # just return if so
+ xor %eax,%eax # else return 0
+ jmp .Lnsx13
+
+.Lnsx11:
+ cmp $1,%ecx ## if inty ==1
+ jnz .Lnsx12 # jump if not
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 # just return if so
+ mov $0x080000000,%eax # else return -0
+ jmp .Lnsx13
+.Lnsx12: # inty <>1
+ and $0x07FFFFFFF,%eax # return -x (|x|) if y<0
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 #
+ xor %eax,%eax # return 0 if y >=0
+.Lnsx13:
+ ret
+
+
+# handle cases where x is +/- zero. edx is the mask of x,y pairs with |x|=0
+ .align 16
+.Lx_zero:
+ movdqa %xmm0,p_temp(%rsp)
+
+ test $1,%edx
+ jz .Lxzera
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov (%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+.Lxzera:
+ test $2,%edx
+ jz .Lxzerb
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 4(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+.Lxzerb:
+ test $4,%edx
+ jz .Lxzerc
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 8(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+.Lxzerc:
+ test $8,%edx
+ jz .Lxzerd
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 12(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+.Lxzerd:
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx2
+
+# a subroutine to treat an individual x,y pair when x is +/-0
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+ .align 16
+.Lnp_special_x2:
+ cmp $1,%ecx ## if inty ==1
+ jz .Lnsx21 # jump if so
+# handle cases of x=+/-0, y not integer
+ xor %eax,%eax
+ mov $0x07f800000,%ecx
+ test $0x080000000,%ebx # is ypos
+ cmovnz %ecx,%eax
+ jmp .Lnsx23
+# y is an integer
+.Lnsx21:
+ xor %r8d,%r8d
+ mov $0x07f800000,%ecx
+ test $0x080000000,%ebx # is ypos
+ cmovnz %ecx,%r8d # set to infinity if not
+ and $0x080000000,%eax # pickup the sign of x
+ or %r8d,%eax # and include it in the result
+.Lnsx23:
+ ret
+
+
+ .data
+ .align 64
+
+.L__mask_sign: .quad 0x08000000080000000 # a sign bit mask
+ .quad 0x08000000080000000
+
+.L__mask_nsign: .quad 0x07FFFFFFF7FFFFFFF # a not sign bit mask
+ .quad 0x07FFFFFFF7FFFFFFF
+
+# used by special case checking
+
+.L__float_one: .quad 0x03f8000003f800000 # one
+ .quad 0x03f8000003f800000
+
+.L__mask_inf: .quad 0x07f8000007F800000 # inifinity
+ .quad 0x07f8000007F800000
+
+.L__mask_sigbit: .quad 0x00040000000400000 # QNaN bit
+ .quad 0x00040000000400000
+
+
diff --git a/src/gas/vrs4sincosf.S b/src/gas/vrs4sincosf.S
new file mode 100644
index 0000000..2c3a0cc
--- /dev/null
+++ b/src/gas/vrs4sincosf.S
@@ -0,0 +1,1813 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4sincosf.asm
+#
+# A vector implementation of the sincos libm function.
+#
+# Prototype:
+#
+# __vrs4_sincosf(__m128 x, __m128 * ys, __m128 * yc);
+#
+# Computes Sine and Cosine of x for an array of input values.
+# Places the Sine results into the supplied ys array and the Cosine results into the supplied yc array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 single precision Sine Cosine values at a time.
+# The four values are passed as packed single in xmm0.
+# The four Sine results are returned as packed singles in the supplied ys array.
+# The four Cosine results are returned as packed singles in the supplied yc array.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+
+.Lcosarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03FA5555555502F31
+ .quad 0x0BF56C16BF55699D7 # -0.00138889 c2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4
+ .quad 0x0BE92524743CC46B8
+
+.Lsinarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BFC555555545E87D
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x03F811110DF01232D
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x0BE92524743CC46B8
+
+.Lcossinarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BF56C16BF55699D7 # c2
+ .quad 0x03F811110DF01232D
+ .quad 0x03EFA015C50A93B49 # c3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x0BE92524743CC46B8 # c4
+ .quad 0x03EC6DBE4AD1572D5
+
+.align 64
+ .Levensin_oddcos_tbl:
+
+ .quad .Lsinsin_sinsin_piby4 # 0 * ; Done
+ .quad .Lsinsin_sincos_piby4 # 1 + ; Done
+ .quad .Lsinsin_cossin_piby4 # 2 ; Done
+ .quad .Lsinsin_coscos_piby4 # 3 + ; Done
+
+ .quad .Lsincos_sinsin_piby4 # 4 ; Done
+ .quad .Lsincos_sincos_piby4 # 5 * ; Done
+ .quad .Lsincos_cossin_piby4 # 6 ; Done
+ .quad .Lsincos_coscos_piby4 # 7 ; Done
+
+ .quad .Lcossin_sinsin_piby4 # 8 ; Done
+ .quad .Lcossin_sincos_piby4 # 9 ; TBD
+ .quad .Lcossin_cossin_piby4 # 10 * ; Done
+ .quad .Lcossin_coscos_piby4 # 11 ; Done
+
+ .quad .Lcoscos_sinsin_piby4 # 12 ; Done
+ .quad .Lcoscos_sincos_piby4 # 13 + ; Done
+ .quad .Lcoscos_cossin_piby4 # 14 ; Done
+ .quad .Lcoscos_coscos_piby4 # 15 * ; Done
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .text
+ .align 16
+ .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ p_temp,0 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for get/put bits operation
+
+.equ save_xmm6,0x20 # temporary for get/put bits operation
+.equ save_xmm7,0x30 # temporary for get/put bits operation
+.equ save_xmm8,0x40 # temporary for get/put bits operation
+.equ save_xmm9,0x50 # temporary for get/put bits operation
+.equ save_xmm0,0x60 # temporary for get/put bits operation
+.equ save_xmm11,0x70 # temporary for get/put bits operation
+.equ save_xmm12,0x80 # temporary for get/put bits operation
+.equ save_xmm13,0x90 # temporary for get/put bits operation
+.equ save_xmm14,0x0A0 # temporary for get/put bits operation
+.equ save_xmm15,0x0B0 # temporary for get/put bits operation
+
+.equ r,0x0C0 # pointer to r for remainder_piby2
+.equ rr,0x0D0 # pointer to r for remainder_piby2
+.equ region,0x0E0 # pointer to r for remainder_piby2
+
+.equ r1,0x0F0 # pointer to r for remainder_piby2
+.equ rr1,0x0100 # pointer to r for remainder_piby2
+.equ region1,0x0110 # pointer to r for remainder_piby2
+
+.equ p_temp2,0x0120 # temporary for get/put bits operation
+.equ p_temp3,0x0130 # temporary for get/put bits operation
+
+.equ p_temp4,0x0140 # temporary for get/put bits operation
+.equ p_temp5,0x0150 # temporary for get/put bits operation
+
+.equ p_original,0x0160 # original x
+.equ p_mask,0x0170 # original x
+.equ p_sign_sin,0x0180 # original x
+
+.equ p_original1,0x0190 # original x
+.equ p_mask1,0x01A0 # original x
+.equ p_sign1_sin,0x01B0 # original x
+
+
+.equ save_r12,0x01C0 # temporary for get/put bits operation
+.equ save_r13,0x01D0 # temporary for get/put bits operation
+
+.equ p_sin,0x01E0 # sin
+.equ p_cos,0x01F0 # cos
+
+.equ save_rdi,0x0200 # temporary for get/put bits operation
+.equ save_rsi,0x0210 # temporary for get/put bits operation
+
+.equ p_sign_cos,0x0220 # Sign of lower cos term
+.equ p_sign1_cos,0x0230 # Sign of upper cos term
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.globl __vrs4_sincosf
+ .type __vrs4_sincosf,@function
+__vrs4_sincosf:
+
+ sub $0x0248,%rsp
+
+ mov %r12,save_r12(%rsp) # save r12
+
+ mov %r13,save_r13(%rsp) # save r13
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+ movhlps %xmm0,%xmm8
+ cvtps2pd %xmm0,%xmm10 # convert input to double.
+ cvtps2pd %xmm8,%xmm1 # convert input to double.
+
+movdqa %xmm10,%xmm6
+movdqa %xmm1,%xmm7
+movapd .L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd %xmm2,%xmm10 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+mov %rdi, p_sin(%rsp) # save address for sin return
+mov %rsi, p_cos(%rsp) # save address for cos return
+
+movd %xmm10,%rax #rax is lower arg
+movhpd %xmm10, p_temp+8(%rsp) #
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+
+movd %xmm1,%r8 #r8 is lower arg
+movhpd %xmm1, p_temp1+8(%rsp) #
+mov p_temp1+8(%rsp),%r9 #r9 = upper arg
+
+movdqa %xmm10,%xmm12
+movdqa %xmm1,%xmm13
+
+pcmpgtd %xmm6,%xmm12
+pcmpgtd %xmm7,%xmm13
+movdqa %xmm12,%xmm6
+movdqa %xmm13,%xmm7
+psrldq $4,%xmm12
+psrldq $4,%xmm13
+psrldq $8,%xmm6
+psrldq $8,%xmm7
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use +
+
+por %xmm6,%xmm12
+por %xmm7,%xmm13
+
+movd %xmm12,%r12 #Move Sign to gpr **
+movd %xmm13,%r13 #Move Sign to gpr **
+
+movapd %xmm10,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm10,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm10
+ mulpd %xmm10,%xmm2 # * twobypi
+ mulpd %xmm10,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%r8 # Region
+ movd %xmm5,%r9 # Region
+
+#DELETE
+# mov .LQWORD,%rdx PTR __reald_one_zero ;compare value for cossin path
+#DELETE
+
+ mov %r8,%r10
+ mov %r9,%r11
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail, r9 = region, r11 = region, r13 = Sign
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+# NEW
+
+ #ADDED
+ mov %r10,%rdi # npi2 in int
+ mov %r11,%rsi # npi2 in int
+ #ADDED
+
+ shr $1,%r10 # 0 and 1 => 0
+ shr $1,%r11 # 2 and 3 => 1
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ #ADDED
+ xor %r10,%rdi # xor last 2 bits of region for cos
+ xor %r11,%rsi # xor last 2 bits of region for cos
+ #ADDED
+
+ not %r12 #~(sign)
+ not %r13 #~(sign)
+ and %r12,%r10 #region & ~(sign)
+ and %r13,%r11 #region & ~(sign)
+
+ not %rax #~(region)
+ not %rcx #~(region)
+ not %r12 #~~(sign)
+ not %r13 #~~(sign)
+ and %r12,%rax #~region & ~~(sign)
+ and %r13,%rcx #~region & ~~(sign)
+
+ #ADDED
+ and .L__reald_one_one(%rip),%rdi # sign for cos
+ and .L__reald_one_one(%rip),%rsi # sign for cos
+ #ADDED
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 # sign for sin
+ and .L__reald_one_one(%rip),%r11 # sign for sin
+
+
+
+
+
+
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ #ADDED
+ mov %rdi,%rax
+ mov %rsi,%rcx
+ #ADDED
+
+ and .L__reald_one_zero(%rip),%r12 #mask out the lower sign bit leaving the upper sign bit
+ and .L__reald_one_zero(%rip),%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ #ADDED
+ and .L__reald_one_zero(%rip),%rax #mask out the lower sign bit leaving the upper sign bit
+ and .L__reald_one_zero(%rip),%rcx #mask out the lower sign bit leaving the upper sign bit
+ #ADDED
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ #ADDED
+ shl $63,%rdi #shift lower sign bit left by 63 bits
+ shl $63,%rsi #shift lower sign bit left by 63 bits
+ shl $31,%rax #shift upper sign bit left by 31 bits
+ shl $31,%rcx #shift upper sign bit left by 31 bits
+ #ADDED
+
+ mov %r10,p_sign_sin(%rsp) #write out lower sign bit
+ mov %r12,p_sign_sin+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1_sin(%rsp) #write out lower sign bit
+ mov %r13,p_sign1_sin+8(%rsp) #write out upper sign bit
+
+ mov %rdi,p_sign_cos(%rsp) #write out lower sign bit
+ mov %rax,p_sign_cos+8(%rsp) #write out upper sign bit
+ mov %rsi,p_sign1_cos(%rsp) #write out lower sign bit
+ mov %rcx,p_sign1_cos+8(%rsp) #write out upper sign bit
+
+# NEW
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm10,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+# subpd %xmm10,%xmm6 ;rr=rhead-r
+# subpd %xmm1,%xmm7 ;rr=rhead-r
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2 # move r for r2
+ movapd %xmm1,%xmm3 # move r for r2
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+# subpd xmm6, xmm8 ;rr=(rhead-r) -rtail
+# subpd xmm7, xmm9 ;rr=(rhead-r) -rtail
+
+ and .L__reald_zero_one(%rip),%rax # region for jump table
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+
+# HARSHA ADDED
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_sin = Sign, p_sign_cos = Sign, xmm10 = r, xmm2 = r2
+# p_sign1_sin = Sign, p_sign1_cos = Sign, xmm1 = r, xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm14 # for x3
+ movapd %xmm3,%xmm15 # for x3
+
+ movapd %xmm2,%xmm0 # for r
+ movapd %xmm3,%xmm11 # for r
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ movdqa .Lsinarray+0x30(%rip),%xmm6 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm7 # c4
+
+ movapd .Lsinarray+0x10(%rip),%xmm12 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm13 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm10,%xmm14 # x3
+ mulpd %xmm1,%xmm15 # x3
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm6 # c2*x2
+ mulpd %xmm3,%xmm7 # c2*x2
+
+ mulpd %xmm2,%xmm12 # c4*x2
+ mulpd %xmm3,%xmm13 # c4*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ addpd .Lsinarray+0x20(%rip),%xmm6 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm7 # c3+x2c4
+
+ addpd .Lsinarray(%rip),%xmm12 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm13 # c1+x2c2
+
+ mulpd %xmm2,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm5 # x4(c3+x2c4)
+
+ mulpd %xmm2,%xmm6 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm7 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ addpd %xmm12,%xmm6 # zs
+ addpd %xmm13,%xmm7 # zs
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ mulpd %xmm14,%xmm6 # x3 * zs
+ mulpd %xmm15,%xmm7 # x3 * zs
+
+ subpd %xmm0,%xmm4 # - (-t)
+ subpd %xmm11,%xmm5 # - (-t)
+
+ addpd %xmm10,%xmm6 # +x
+ addpd %xmm1,%xmm7 # +x
+
+# HARSHA ADDED
+
+ lea .Levensin_oddcos_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+ movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm10
+ subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail)
+ subsd %xmm10,%xmm6 # rr=rhead-r
+ subsd %xmm0,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm10,r+8(%rsp) # store upper r
+ movlpd %xmm6,rr+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_lower_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_sincosf_lower_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+ movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+ movd %xmm10,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm6,%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+
+# movsd %xmm6,%xmm10
+# subsd xmm10,xmm0 ; xmm10 = r=(rhead-rtail)
+# subsd %xmm10,%xmm6 ; rr=rhead-r
+# subsd xmm6, xmm0 ; xmm6 = rr=((rhead-r) -rtail)
+
+ subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail)
+
+# movlpd QWORD PTR r[rsp], xmm10 ; store upper r
+# movlpd QWORD PTR rr[rsp], xmm6 ; store upper rr
+
+ movlpd %xmm6,r(%rsp) # store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_upper_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_sincosf_upper_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm5,region1(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm1,%xmm7 ; rhead
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+# subpd %xmm1,%xmm7 ; rr=rhead-r
+# subpd xmm7, xmm9 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr1[rsp], xmm7
+
+ jmp .L__vrs4_sincosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm10,%xmm6 ; rhead
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# subpd %xmm10,%xmm6 ; rr=rhead-r
+# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+ mov $0x411E848000000000,%r10 #5e5 +
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+
+
+ subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail)
+
+ movlpd %xmm7,r1+8(%rsp) # store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sincosf_lower_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_sincosf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+ movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm1,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+
+
+ jmp 0f
+
+.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm7,%rdi #Restore upper fp arg for remainder_piby2 call
+
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) #region = 0
+
+.align 16
+0:
+
+ jmp .L__vrs4_sincosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm10,%xmm6 ; rhead
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# subpd %xmm10,%xmm6 ; rr=rhead-r
+# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+
+# movsd %xmm7,%xmm1
+# subsd xmm1, xmm10 ; xmm10 = r=(rhead-rtail)
+# subsd %xmm1,%xmm7 ; rr=rhead-r
+# subsd xmm7, xmm10 ; xmm6 = rr=((rhead-r) -rtail)
+
+ subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail)
+
+# movlpd QWORD PTR r1[rsp], xmm1 ; store upper r
+# movlpd QWORD PTR rr1[rsp], xmm7 ; store upper rr
+
+ movlpd %xmm7,r1(%rsp) # store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sincosf_upper_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_sincosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sincosf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_sin = Sign, ; p_sign_cos = Sign, xmm10 = r, xmm2 = r2
+# p_sign1_sin = Sign, ; p_sign1_cos = Sign, xmm1 = r, xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd r(%rsp),%xmm10
+ movapd r1(%rsp),%xmm1
+
+ mov region(%rsp),%r8
+ mov region1(%rsp),%r9
+
+ mov %r8,%r10
+ mov %r9,%r11
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+
+# NEW
+
+ #ADDED
+ mov %r10,%rdi
+ mov %r11,%rsi
+ #ADDED
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ #ADDED
+ xor %r10,%rdi
+ xor %r11,%rsi
+ #ADDED
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ #ADDED
+ and .L__reald_one_one(%rip),%rdi #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%rsi #(~AB+A~B)&1
+ #ADDED
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+
+
+
+
+
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ #ADDED
+ mov %rdi,%rax
+ mov %rsi,%rcx
+ #ADDED
+
+ and .L__reald_one_zero(%rip),%r12 #mask out the lower sign bit leaving the upper sign bit
+ and .L__reald_one_zero(%rip),%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ #ADDED
+ and .L__reald_one_zero(%rip),%rax #mask out the lower sign bit leaving the upper sign bit
+ and .L__reald_one_zero(%rip),%rcx #mask out the lower sign bit leaving the upper sign bit
+ #ADDED
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ #ADDED
+ shl $63,%rdi #shift lower sign bit left by 63 bits
+ shl $63,%rsi #shift lower sign bit left by 63 bits
+ shl $31,%rax #shift upper sign bit left by 31 bits
+ shl $31,%rcx #shift upper sign bit left by 31 bits
+ #ADDED
+
+ mov %r10,p_sign_sin(%rsp) #write out lower sign bit
+ mov %r12,p_sign_sin+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1_sin(%rsp) #write out lower sign bit
+ mov %r13,p_sign1_sin+8(%rsp) #write out upper sign bit
+
+ mov %rdi,p_sign_cos(%rsp) #write out lower sign bit
+ mov %rax,p_sign_cos+8(%rsp) #write out upper sign bit
+ mov %rsi,p_sign1_cos(%rsp) #write out lower sign bit
+ mov %rcx,p_sign1_cos+8(%rsp) #write out upper sign bit
+#NEW
+
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+
+# HARSHA ADDED
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_cos = Sign, p_sign_sin = Sign, xmm10 = r, xmm2 = r2
+# p_sign1_cos = Sign, p_sign1_sin = Sign, xmm1 = r, xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm14 # for x3
+ movapd %xmm3,%xmm15 # for x3
+
+ movapd %xmm2,%xmm0 # for r
+ movapd %xmm3,%xmm11 # for r
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ movdqa .Lsinarray+0x30(%rip),%xmm6 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm7 # c4
+
+ movapd .Lsinarray+0x10(%rip),%xmm12 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm13 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm10,%xmm14 # x3
+ mulpd %xmm1,%xmm15 # x3
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm6 # c2*x2
+ mulpd %xmm3,%xmm7 # c2*x2
+
+ mulpd %xmm2,%xmm12 # c4*x2
+ mulpd %xmm3,%xmm13 # c4*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ addpd .Lsinarray+0x20(%rip),%xmm6 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm7 # c3+x2c4
+
+ addpd .Lsinarray(%rip),%xmm12 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm13 # c1+x2c2
+
+ mulpd %xmm2,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm5 # x4(c3+x2c4)
+
+ mulpd %xmm2,%xmm6 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm7 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ addpd %xmm12,%xmm6 # zs
+ addpd %xmm13,%xmm7 # zs
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ mulpd %xmm14,%xmm6 # x3 * zs
+ mulpd %xmm15,%xmm7 # x3 * zs
+
+ subpd %xmm0,%xmm4 # - (-t)
+ subpd %xmm11,%xmm5 # - (-t)
+
+ addpd %xmm10,%xmm6 # +x
+ addpd %xmm1,%xmm7 # +x
+
+# HARSHA ADDED
+
+
+ lea .Levensin_oddcos_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sincosf_cleanup:
+
+ mov p_sin(%rsp),%rdi
+ mov p_cos(%rsp),%rsi
+
+ movapd p_sign_cos(%rsp),%xmm10
+ movapd p_sign1_cos(%rsp),%xmm1
+
+
+ xorpd %xmm4,%xmm10 # Cos term (+) Sign
+ xorpd %xmm5,%xmm1 # Cos term (+) Sign
+
+ cvtpd2ps %xmm10,%xmm0
+ cvtpd2ps %xmm1,%xmm11
+
+ movapd p_sign_sin(%rsp),%xmm14
+ movapd p_sign1_sin(%rsp),%xmm15
+
+ xorpd %xmm6,%xmm14 # Sin term (+) Sign
+ xorpd %xmm7,%xmm15 # Sin term (+) Sign
+
+ cvtpd2ps %xmm14,%xmm12
+ cvtpd2ps %xmm15,%xmm13
+
+ movlps %xmm0,(%rsi) # save the cos
+ movlps %xmm12,(%rdi) # save the sin
+ movlps %xmm11,8(%rsi) # save the cos
+ movlps %xmm13,8(%rdi) # save the sin
+
+ mov save_r12(%rsp),%r12 # restore r12
+ mov save_r13(%rsp),%r13 # restore r13
+
+ add $0x0248,%rsp
+ ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 16
+.Lcoscos_coscos_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower and Upper Even
+
+ movapd %xmm4,%xmm8
+ movapd %xmm5,%xmm9
+
+ movapd %xmm6,%xmm4
+ movapd %xmm7,%xmm5
+
+ movapd %xmm8,%xmm6
+ movapd %xmm9,%xmm7
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+
+ movhlps %xmm5,%xmm9
+ movhlps %xmm7,%xmm13
+
+ movlhps %xmm9,%xmm7
+ movlhps %xmm13,%xmm5
+
+ movhlps %xmm4,%xmm8
+ movhlps %xmm6,%xmm12
+
+ movlhps %xmm8,%xmm6
+ movlhps %xmm12,%xmm4
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsincos_cossin_piby4:
+ movsd %xmm5,%xmm9
+ movsd %xmm7,%xmm13
+
+ movsd %xmm9,%xmm7
+ movsd %xmm13,%xmm5
+
+ movhlps %xmm4,%xmm8
+ movhlps %xmm6,%xmm12
+
+ movlhps %xmm8,%xmm6
+ movlhps %xmm12,%xmm4
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+ movsd %xmm5,%xmm9
+ movsd %xmm7,%xmm13
+
+ movsd %xmm9,%xmm7
+ movsd %xmm13,%xmm5
+
+ movsd %xmm4,%xmm8
+ movsd %xmm6,%xmm12
+
+ movsd %xmm8,%xmm6
+ movsd %xmm12,%xmm4
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+ movhlps %xmm5,%xmm9
+ movhlps %xmm7,%xmm13
+
+ movlhps %xmm9,%xmm7
+ movlhps %xmm13,%xmm5
+
+ movsd %xmm4,%xmm8
+ movsd %xmm6,%xmm12
+
+ movsd %xmm8,%xmm6
+ movsd %xmm12,%xmm4
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcoscos_sinsin_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower even, Upper odd, Swap upper
+
+ movapd %xmm5,%xmm9
+ movapd %xmm7,%xmm5
+ movapd %xmm9,%xmm7
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower odd, Upper even, Swap lower
+
+ movapd %xmm4,%xmm8
+ movapd %xmm6,%xmm4
+ movapd %xmm8,%xmm6
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcoscos_cossin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+ movapd %xmm5,%xmm9
+ movapd %xmm7,%xmm5
+ movapd %xmm9,%xmm7
+
+ movhlps %xmm4,%xmm8
+ movhlps %xmm6,%xmm12
+
+ movlhps %xmm8,%xmm6
+ movlhps %xmm12,%xmm4
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+ movapd %xmm5,%xmm9
+ movapd %xmm7,%xmm5
+ movapd %xmm9,%xmm7
+
+ movsd %xmm4,%xmm8
+ movsd %xmm6,%xmm12
+
+ movsd %xmm8,%xmm6
+ movsd %xmm12,%xmm4
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+ movapd %xmm4,%xmm8
+ movapd %xmm6,%xmm4
+ movapd %xmm8,%xmm6
+
+ movhlps %xmm5,%xmm9
+ movhlps %xmm7,%xmm13
+
+ movlhps %xmm9,%xmm7
+ movlhps %xmm13,%xmm5
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+ movhlps %xmm5,%xmm9
+ movhlps %xmm7,%xmm13
+
+ movlhps %xmm9,%xmm7
+ movlhps %xmm13,%xmm5
+
+ jmp .L__vrs4_sincosf_cleanup
+
+
+.align 16
+.Lsincos_coscos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+ movapd %xmm4,%xmm8
+ movapd %xmm6,%xmm4
+ movapd %xmm8,%xmm6
+
+ movsd %xmm5,%xmm9
+ movsd %xmm7,%xmm13
+
+ movsd %xmm9,%xmm7
+ movsd %xmm13,%xmm5
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+ movsd %xmm5,%xmm9
+ movsd %xmm7,%xmm5
+ movsd %xmm9,%xmm7
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+ movhlps %xmm4,%xmm8
+ movhlps %xmm6,%xmm12
+
+ movlhps %xmm8,%xmm6
+ movlhps %xmm12,%xmm4
+
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+ movsd %xmm4,%xmm8
+ movsd %xmm6,%xmm4
+ movsd %xmm8,%xmm6
+ jmp .L__vrs4_sincosf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+# Lower and Upper odd, So Swap
+
+ jmp .L__vrs4_sincosf_cleanup
diff --git a/src/gas/vrs4sinf.S b/src/gas/vrs4sinf.S
new file mode 100644
index 0000000..3744f33
--- /dev/null
+++ b/src/gas/vrs4sinf.S
@@ -0,0 +1,2171 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs4sinf.s
+#
+# A vector implementation of the sin libm function.
+#
+# Prototype:
+#
+# __m128 __vrs4_sinf(__m128 x);
+#
+# Computes Sine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 single precision Sine values at a time.
+# The four values are passed as packed single in xmm10.
+# The four results are returned as packed singles in xmm10.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+
+.Lcosarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03FA5555555502F31
+ .quad 0x0BF56C16BF55699D7 # -0.00138889 c2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4
+ .quad 0x0BE92524743CC46B8
+
+.Lsinarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BFC555555545E87D
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x03F811110DF01232D
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x0BE92524743CC46B8
+
+.Lcossinarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BF56C16BF55699D7 # c2
+ .quad 0x03F811110DF01232D
+ .quad 0x03EFA015C50A93B49 # c3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x0BE92524743CC46B8 # c4
+ .quad 0x03EC6DBE4AD1572D5
+
+.align 64
+ .Levensin_oddcos_tbl:
+
+ .quad .Lsinsin_sinsin_piby4 # 0 * ; Done
+ .quad .Lsinsin_sincos_piby4 # 1 + ; Done
+ .quad .Lsinsin_cossin_piby4 # 2 ; Done
+ .quad .Lsinsin_coscos_piby4 # 3 + ; Done
+
+ .quad .Lsincos_sinsin_piby4 # 4 ; Done
+ .quad .Lsincos_sincos_piby4 # 5 * ; Done
+ .quad .Lsincos_cossin_piby4 # 6 ; Done
+ .quad .Lsincos_coscos_piby4 # 7 ; Done
+
+ .quad .Lcossin_sinsin_piby4 # 8 ; Done
+ .quad .Lcossin_sincos_piby4 # 9 ; TBD
+ .quad .Lcossin_cossin_piby4 # 10 * ; Done
+ .quad .Lcossin_coscos_piby4 # 11 ; Done
+
+ .quad .Lcoscos_sinsin_piby4 # 12 ; Done
+ .quad .Lcoscos_sincos_piby4 # 13 + ; Done
+ .quad .Lcoscos_cossin_piby4 # 14 ; Done
+ .quad .Lcoscos_coscos_piby4 # 15 * ; Done
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .text
+ .align 16
+ .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ p_temp,0 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for get/put bits operation
+
+.equ save_xmm6,0x20 # temporary for get/put bits operation
+.equ save_xmm7,0x30 # temporary for get/put bits operation
+.equ save_xmm8,0x40 # temporary for get/put bits operation
+.equ save_xmm9,0x50 # temporary for get/put bits operation
+.equ save_xmm0,0x60 # temporary for get/put bits operation
+.equ save_xmm11,0x70 # temporary for get/put bits operation
+.equ save_xmm12,0x80 # temporary for get/put bits operation
+.equ save_xmm13,0x90 # temporary for get/put bits operation
+.equ save_xmm14,0x0A0 # temporary for get/put bits operation
+.equ save_xmm15,0x0B0 # temporary for get/put bits operation
+
+.equ r,0x0C0 # pointer to r for remainder_piby2
+.equ rr,0x0D0 # pointer to r for remainder_piby2
+.equ region,0x0E0 # pointer to r for remainder_piby2
+
+.equ r1,0x0F0 # pointer to r for remainder_piby2
+.equ rr1,0x0100 # pointer to r for remainder_piby2
+.equ region1,0x0110 # pointer to r for remainder_piby2
+
+.equ p_temp2,0x0120 # temporary for get/put bits operation
+.equ p_temp3,0x0130 # temporary for get/put bits operation
+
+.equ p_temp4,0x0140 # temporary for get/put bits operation
+.equ p_temp5,0x0150 # temporary for get/put bits operation
+
+.equ p_original,0x0160 # original x
+.equ p_mask,0x0170 # original x
+.equ p_sign,0x0180 # original x
+
+.equ p_original1,0x0190 # original x
+.equ p_mask1,0x01A0 # original x
+.equ p_sign1,0x01B0 # original x
+
+.equ save_r12,0x01C0 # temporary for get/put bits operation
+.equ save_r13,0x01D0 # temporary for get/put bits operation
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.globl __vrs4_sinf
+ .type __vrs4_sinf,@function
+__vrs4_sinf:
+
+ sub $0x01E8,%rsp
+
+ mov %r12,save_r12(%rsp) # save r12
+
+ mov %r13,save_r13(%rsp) # save r13
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+ movhlps %xmm0,%xmm8
+ cvtps2pd %xmm0,%xmm10 # convert input to double.
+ cvtps2pd %xmm8,%xmm1 # convert input to double.
+
+movdqa %xmm10,%xmm6
+movdqa %xmm1,%xmm7
+movapd .L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd %xmm2,%xmm10 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+movd %xmm10,%rax #rax is lower arg
+movhpd %xmm10, p_temp+8(%rsp) #
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+
+movd %xmm1,%r8 #r8 is lower arg
+movhpd %xmm1, p_temp1+8(%rsp) #
+mov p_temp1+8(%rsp),%r9 #r9 = upper arg
+
+movdqa %xmm10,%xmm12
+movdqa %xmm1,%xmm13
+
+pcmpgtd %xmm6,%xmm12
+pcmpgtd %xmm7,%xmm13
+movdqa %xmm12,%xmm6
+movdqa %xmm13,%xmm7
+psrldq $4,%xmm12
+psrldq $4,%xmm13
+psrldq $8,%xmm6
+psrldq $8,%xmm7
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use +
+
+por %xmm6,%xmm12
+por %xmm7,%xmm13
+movd %xmm12,%r12 #Move Sign to gpr **
+movd %xmm13,%r13 #Move Sign to gpr **
+
+movapd %xmm10,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm10,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm10
+ mulpd %xmm10,%xmm2 # * twobypi
+ mulpd %xmm10,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%r8 # Region
+ movd %xmm5,%r9 # Region
+
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+ mov %r8,%r10
+ mov %r9,%r11
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm10,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2 # move r for r2
+ movapd %xmm1,%xmm3 # move r for r2
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+ lea .Levensin_oddcos_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+ movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm10
+ subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail)
+ movlpd %xmm10,r+8(%rsp) # store upper r
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_lower_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_sinf_lower_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+ movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+ movd %xmm10,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrs4_sinf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+# mov .LQWORD,%rax PTR p_original[rsp]
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm6,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrs4_sinf_upper_naninf_of_both_gt_5e5:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+
+ subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail)
+
+ movlpd %xmm6,r(%rsp) # store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_upper_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r+8(%rsp),%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_sinf_upper_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm5,region1(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm1,%xmm7 ; rhead
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+# subpd %xmm1,%xmm7 ; rr=rhead-r
+# subpd xmm7, xmm9 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr1[rsp], xmm7
+
+ jmp .L__vrs4_sinf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm10,%xmm6 ; rhead
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# subpd %xmm10,%xmm6 ; rr=rhead-r
+# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+ mov $0x411E848000000000,%r10 #5e5 +
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+
+ subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail)
+
+ movlpd %xmm7,r1+8(%rsp) # store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1(%rsp),%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sinf_lower_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_sinf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+ movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm1,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+
+ jmp 0f
+
+.L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm7,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .L__vrs4_sinf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm10,%xmm6 ; rhead
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# subpd %xmm10,%xmm6 ; rr=rhead-r
+# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+
+# movsd %xmm7,%xmm1
+# subsd xmm1, xmm10 ; xmm10 = r=(rhead-rtail)
+# subsd %xmm1,%xmm7 ; rr=rhead-r
+# subsd xmm7, xmm10 ; xmm6 = rr=((rhead-r) -rtail)
+
+ subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail)
+
+# movlpd QWORD PTR r1[rsp], xmm1 ; store upper r
+# movlpd QWORD PTR rr1[rsp], xmm7 ; store upper rr
+
+ movlpd %xmm7,r1(%rsp) # store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1+8(%rsp),%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sinf_upper_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_sinf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sinf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd r(%rsp),%xmm10
+ movapd r1(%rsp),%xmm1
+
+ mov region(%rsp),%r8
+ mov region1(%rsp),%r9
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+
+ mov %r8,%r10
+ mov %r9,%r11
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+
+ lea .Levensin_oddcos_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sinf_cleanup:
+
+ movapd p_sign(%rsp),%xmm10
+ movapd p_sign1(%rsp),%xmm1
+
+ xorpd %xmm4,%xmm10 # (+) Sign
+ xorpd %xmm5,%xmm1 # (+) Sign
+
+ cvtpd2ps %xmm10,%xmm0
+ cvtpd2ps %xmm1,%xmm11
+ movlhps %xmm11,%xmm0
+
+ mov save_r12(%rsp),%r12 # restore r12
+ mov save_r13(%rsp),%r13 # restore r13
+
+ add $0x01E8,%rsp
+ ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+ movapd %xmm2,%xmm0 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm2,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm0,%xmm4 # + t
+ subpd %xmm11,%xmm5 # + t
+
+ jmp .L__vrs4_sinf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # s4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # s4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # s2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # s4+x2s3
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # s4+x2s3
+ addpd .Lsincosarray(%rip),%xmm8 # s2+x2s1
+ addpd .Lsincosarray(%rip),%xmm9 # s2+x2s1
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm0,%xmm0 # move high x4 for cos term
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd %xmm10,%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos
+ movhlps %xmm3,%xmm13 # move high r for cos
+
+ movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos
+ movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm7,%xmm5 # sin *x3
+
+ mulsd %xmm0,%xmm8 # cos *x4
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0
+
+ addsd %xmm10,%xmm4 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+ subsd %xmm12,%xmm8 # cos+t
+ subsd %xmm13,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrs4_sinf_cleanup
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:
+
+ movapd .Lsincosarray+0x30(%rip),%xmm4 # s4
+ movapd .Lcossinarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lsincosarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm3,%xmm7 # sincos term upper x2 for x3
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # s3+x2s4
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # s3+x2s4
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2s2
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2s2
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm0,%xmm0 # move high x4 for cos term
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin)
+ mulpd %xmm1,%xmm7
+
+ mulsd %xmm10,%xmm6 # get low x3 for sin term (cossin)
+ movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+
+ movhlps %xmm2,%xmm12 # move high r for cos (cossin)
+
+
+ movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin)
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos)
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm11,%xmm5 # cos *x4
+ mulsd %xmm0,%xmm8 # cos *x4
+ mulsd %xmm7,%xmm9 # sin *x3
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0
+
+ movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos)
+
+ addsd %xmm10,%xmm4 # sin + x +
+ addsd %xmm11,%xmm9 # sin + x +
+
+ subsd %xmm12,%xmm8 # cos+t
+ subsd %xmm3,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrs4_sinf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd .Lcossinarray+0x30(%rip),%xmm4 # s4
+ movapd .Lcossinarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm2,%xmm6 # move x2 for x4
+ movapd %xmm3,%xmm7 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # s4+x2s3
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # s4+x2s3
+ addpd .Lcossinarray(%rip),%xmm8 # s2+x2s1
+ addpd .Lcossinarray(%rip),%xmm9 # s2+x2s1
+
+ mulpd %xmm0,%xmm4 # x4(s4+x2s3)
+ mulpd %xmm11,%xmm5 # x4(s4+x2s3)
+
+ mulpd %xmm10,%xmm6 # get low x3 for sin term
+ mulpd %xmm1,%xmm7 # get low x3 for sin term
+ movhlps %xmm6,%xmm6 # move low x2 for x3 for sin term
+ movhlps %xmm7,%xmm7 # move low x2 for x3 for sin term
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos terms
+ mulsd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm4,%xmm12 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm13 # xmm9 = sin , xmm5 = cos
+
+ mulsd %xmm6,%xmm12 # sin *x3
+ mulsd %xmm7,%xmm13 # sin *x3
+ mulsd %xmm0,%xmm4 # cos *x4
+ mulsd %xmm11,%xmm5 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0
+
+ movhlps %xmm10,%xmm0 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x for sin term
+ # Reverse 10 and 0
+
+ addsd %xmm0,%xmm12 # sin + x
+ addsd %xmm11,%xmm13 # sin + x
+
+ subsd %xmm2,%xmm4 # cos+t
+ subsd %xmm3,%xmm5 # cos+t
+
+ movlhps %xmm12,%xmm4
+ movlhps %xmm13,%xmm5
+ jmp .L__vrs4_sinf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+
+ movapd .Lcossinarray+0x30(%rip),%xmm4 # s4
+ movapd .Lsincosarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lsincosarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm2,%xmm7 # upper x2 for x3 for sin term (sincos)
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # s3+x2s4
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # s3+x2s4
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2s2
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2s2
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+
+ movsd %xmm3,%xmm6 # move low x2 for x3 for sin term (cossin)
+ mulpd %xmm10,%xmm7
+
+ mulsd %xmm1,%xmm6 # get low x3 for sin term (cossin)
+ movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos)
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+
+ movhlps %xmm3,%xmm12 # move high r for cos (cossin)
+
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos (sincos)
+ movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin (cossin)
+
+ mulsd %xmm0,%xmm4 # cos *x4
+ mulsd %xmm6,%xmm5 # sin *x3
+ mulsd %xmm7,%xmm8 # sin *x3
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0
+
+ movhlps %xmm10,%xmm11 # move high x for x for sin term (sincos)
+
+ subsd %xmm2,%xmm4 # cos-(-t)
+ subsd %xmm12,%xmm9 # cos-(-t)
+
+ addsd %xmm11,%xmm8 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrs4_sinf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: SIN
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: COS
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm0 # x2 ; SIN
+ movapd %xmm3,%xmm11 # x2 ; COS
+ movapd %xmm3,%xmm1 # copy of x2 for x4
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm0 # x4
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm3,%xmm1 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+
+ mulpd %xmm0,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm1,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm1,%xmm5 # x4 * zc
+
+ addpd %xmm10,%xmm4 # +x
+ subpd %xmm11,%xmm5 # +t
+
+ jmp .L__vrs4_sinf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: COS
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: SIN
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm0 # x2 ; COS
+ movapd %xmm3,%xmm11 # x2 ; SIN
+ movapd %xmm2,%xmm10 # copy of x2 for x4
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # s4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # s2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # s4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # s2*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # s3+x2c4
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # s1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm10,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm10,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x3 * zc
+
+ subpd %xmm0,%xmm4 # +t
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrs4_sinf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4: #Derive from cossin_coscos
+ movhlps %xmm2,%xmm0 # x2 for 0.5x2 for upper cos
+ movsd %xmm2,%xmm6 # lower x2 for x3 for lower sin
+ movapd %xmm3,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lsincosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcosarray(%rip),%xmm9 # c2+x2c1
+
+
+ movapd %xmm12,%xmm2 # upper=x4
+ movsd %xmm6,%xmm2 # lower=x2
+ mulsd %xmm10,%xmm2 # lower=x2*x
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # upper= x4 * zc
+ # lower=x3 * zs
+ mulpd %xmm13,%xmm5 # x4 * zc
+
+
+ movlhps %xmm7,%xmm10 #
+ addpd %xmm10,%xmm4 # +x for lower sin, +t for upper cos
+ subpd %xmm11,%xmm5 # -(-t)
+
+ jmp .L__vrs4_sinf_cleanup
+.align 16
+.Lcoscos_sincos_piby4: #Derive from cossin_coscos
+ movsd %xmm2,%xmm0 # x2 for 0.5x2 for lower cos
+ movapd %xmm3,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcossinarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcossinarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcossinarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcosarray(%rip),%xmm9 # c2+x2c1
+
+ mulpd %xmm10,%xmm2 # upper=x3 for sin
+ mulsd %xmm10,%xmm2 # lower=x4 for cos
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # lower= x4 * zc
+ # upper= x3 * zs
+ mulpd %xmm13,%xmm5 # x4 * zc
+
+
+ movsd %xmm7,%xmm10
+ addpd %xmm10,%xmm4 # +x for upper sin, +t for lower cos
+ subpd %xmm11,%xmm5 # -(-t)
+
+ jmp .L__vrs4_sinf_cleanup
+.align 16
+.Lcossin_coscos_piby4:
+ movhlps %xmm3,%xmm0 # x2 for 0.5x2 for upper cos
+ movapd %xmm2,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd %xmm3,%xmm6 # lower x2 for x3 for sin
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # c2
+
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lsincosarray(%rip),%xmm9 # c2+x2c1
+
+ movapd %xmm13,%xmm3 # upper=x4
+ movsd %xmm6,%xmm3 # lower x2
+ mulsd %xmm1,%xmm3 # lower x2*x
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm12,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # upper= x4 * zc
+ # lower=x3 * zs
+
+ movlhps %xmm7,%xmm1
+ addpd %xmm1,%xmm5 # +x for lower sin, +t for upper cos
+ subpd %xmm11,%xmm4 # -(-t)
+
+ jmp .L__vrs4_sinf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4: # Derived from sincos_coscos
+
+ movhlps %xmm3,%xmm0 # x2
+ movapd %xmm3,%xmm7
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsincosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+ movapd %xmm13,%xmm3 # upper x4 for cos
+ movsd %xmm7,%xmm3 # lower x2 for sin
+ mulsd %xmm1,%xmm3 # lower x3=x2*x for sin
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm11,%xmm1 # t for upper cos and x for lower sin
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # upper=x4 * zc
+ # lower=x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +t upper, +x lower
+
+
+ jmp .L__vrs4_sinf_cleanup
+.align 16
+.Lsincos_coscos_piby4:
+ movsd %xmm3,%xmm0 # x2 for 0.5x2 for lower cos
+ movapd %xmm2,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lcossinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcossinarray(%rip),%xmm9 # c2+x2c1
+
+ mulpd %xmm1,%xmm3 # upper=x3 for sin
+ mulsd %xmm1,%xmm3 # lower=x4 for cos
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm12,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # lower= x4 * zc
+ # upper= x3 * zs
+
+ movsd %xmm7,%xmm1
+ subpd %xmm11,%xmm4 # -(-t)
+ addpd %xmm1,%xmm5 # +x for upper sin, +t for lower cos
+
+
+ jmp .L__vrs4_sinf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4: # Derived from sincos_coscos
+
+ movsd %xmm3,%xmm0 # x2
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcossinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcossinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm1,%xmm6
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm6,%xmm11 # upper =t ; lower =x
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm11,%xmm5 # +t lower, +x upper
+
+ jmp .L__vrs4_sinf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4: # Derived from sincos_coscos
+
+ movhlps %xmm2,%xmm0 # x2
+ movapd %xmm2,%xmm7
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsincosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ movapd %xmm12,%xmm2 # upper x4 for cos
+ movsd %xmm7,%xmm2 # lower x2 for sin
+ mulsd %xmm10,%xmm2 # lower x3=x2*x for sin
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm11,%xmm10 # t for upper cos and x for lower sin
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # upper=x4 * zc
+ # lower=x3 * zs
+
+ addpd %xmm1,%xmm5 # +x
+ addpd %xmm10,%xmm4 # +t upper, +x lower
+
+ jmp .L__vrs4_sinf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4: # Derived from sincos_coscos
+
+ movsd %xmm2,%xmm0 # x2
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lcossinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcossinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lcossinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm10,%xmm2 # upper x3 for sin
+ mulsd %xmm10,%xmm2 # lower x4 for cos
+
+ movhlps %xmm10,%xmm6
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm6,%xmm11
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ addpd %xmm1,%xmm5 # +x
+ addpd %xmm11,%xmm4 # +t lower, +x upper
+
+ jmp .L__vrs4_sinf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ #x2 = x * x;
+ #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))));
+
+ #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4));
+
+
+ movapd %xmm2,%xmm0 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ mulpd %xmm10,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # x3
+
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm0,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm11,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrs4_sinf_cleanup
diff --git a/src/gas/vrs8expf.S b/src/gas/vrs8expf.S
new file mode 100644
index 0000000..b2eb597
--- /dev/null
+++ b/src/gas/vrs8expf.S
@@ -0,0 +1,618 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs8expf.s
+#
+# A vector implementation of the expf libm function.
+#
+# Prototype:
+#
+# void vs_expf(int n, float *x, float *y);
+#
+# Computes e raised to the x power for a eight packed single values.
+# Places the results into xmm0 an xmm1.
+# This routine implemented in single precision. It is slightly
+# less accurate than the double precision version, but it will
+# be better for vectorizing.
+# Does not perform error handling, but does return C99 values for error
+# inputs. Denormal results are truncated to 0.
+
+# This array version is basically a unrolling of the by4 scalar single
+# routine. The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+# The scheduling is done by trial and error. The resulting code represents
+# the best time of many variations. It would seem more interleaving could
+# be done, as there is a long stretch of the second computation that is not
+# interleaved. But moving any of this code forward makes the routine
+# slower.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ p_ux,0x00 #qword
+.equ p_ux2,0x010 #qword
+
+.equ save_xa,0x020 #qword
+.equ save_ya,0x028 #qword
+.equ save_nv,0x030 #qword
+
+
+.equ p_iter,0x038 # qword storage for number of loop iterations
+
+.equ p_j,0x040 # second temporary for get/put bits operation
+.equ p_m,0x050 #qword
+.equ p_j2,0x060 # second temporary for exponent multiply
+.equ p_m2,0x070 #qword
+.equ save_rbx,0x080 #qword
+
+
+.equ stack_size,0x098
+
+
+# parameters passed by gcc as:
+# xmm0 - __m128d x1
+# xmm1 - __m128d x2
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrs8_expf
+ .type __vrs8_expf,@function
+__vrs8_expf:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp)
+
+
+# Process the array 8 values at a time.
+
+ movaps .L__real_thirtytwo_by_log2(%rip),%xmm3 #
+
+ movaps %xmm0,p_ux(%rsp)
+ maxps .L__real_m8192(%rip),%xmm0
+ movaps %xmm1,p_ux2(%rsp)
+ maxps .L__real_m8192(%rip),%xmm1
+ movaps %xmm1,%xmm6
+# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+# Step 1. Reduce the argument.
+ # r = x * thirtytwo_by_logbaseof2;
+ movaps .L__real_thirtytwo_by_log2(%rip),%xmm2 #
+
+ mulps %xmm0,%xmm2
+ xor %rax,%rax
+ minps .L__real_8192(%rip),%xmm2
+ movaps .L__real_thirtytwo_by_log2(%rip),%xmm5 #
+
+ mulps %xmm6,%xmm5
+ minps .L__real_8192(%rip),%xmm5 # protect against large input values
+
+# /* Set n = nearest integer to r */
+ cvtps2dq %xmm2,%xmm3
+ lea .L__two_to_jby32_table(%rip),%rdi
+ cvtdq2ps %xmm3,%xmm1
+
+ cvtps2dq %xmm5,%xmm8
+ cvtdq2ps %xmm8,%xmm7
+# r1 = x - n * logbaseof2_by_32_lead;
+ movaps .L__real_log2_by_32_head(%rip),%xmm2
+ mulps %xmm1,%xmm2
+ subps %xmm2,%xmm0 # r1 in xmm0,
+
+ movaps .L__real_log2_by_32_head(%rip),%xmm5
+ mulps %xmm7,%xmm5
+ subps %xmm5,%xmm6 # r1 in xmm6,
+
+
+# r2 = - n * logbaseof2_by_32_lead;
+ mulps .L__real_log2_by_32_tail(%rip),%xmm1
+ mulps .L__real_log2_by_32_tail(%rip),%xmm7
+
+# j = n & 0x0000001f;
+ movdqa %xmm3,%xmm4
+ movdqa .L__int_mask_1f(%rip),%xmm2
+ movdqa %xmm8,%xmm9
+ movdqa .L__int_mask_1f(%rip),%xmm5
+ pand %xmm4,%xmm2
+ movdqa %xmm2,p_j(%rsp)
+# f1 = two_to_jby32_lead_table[j);
+
+ pand %xmm9,%xmm5
+ movdqa %xmm5,p_j2(%rsp)
+
+# *m = (n - j) / 32;
+ psubd %xmm2,%xmm4
+ psrad $5,%xmm4
+ movdqa %xmm4,p_m(%rsp)
+ psubd %xmm5,%xmm9
+ psrad $5,%xmm9
+ movdqa %xmm9,p_m2(%rsp)
+
+ movaps %xmm0,%xmm3
+ addps %xmm1,%xmm3 # r = r1+ r2
+
+ mov p_j(%rsp),%eax # get an individual index
+
+ movaps %xmm6,%xmm8
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ addps %xmm7,%xmm8 # r = r1+ r2
+ mov %edx,p_j(%rsp) # save the f1 value
+
+
+# Step 2. Compute the polynomial.
+# q = r1 +
+# r*r*( 5.00000000000000008883e-01 +
+# r*( 1.66666666665260878863e-01 +
+# r*( 4.16666666662260795726e-02 +
+# r*( 8.33336798434219616221e-03 +
+# r*( 1.38889490863777199667e-03 )))));
+# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+# q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision
+ movaps %xmm3,%xmm4
+ movaps %xmm3,%xmm2
+ mulps %xmm2,%xmm2 # x*x
+ mulps .L__real_1_24(%rip),%xmm4 # /24
+
+ mov p_j+4(%rsp),%eax # get an individual index
+
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j+4(%rsp) # save the f1 value
+
+
+ addps .L__real_1_6(%rip),%xmm4 # +1/6
+
+ mulps %xmm2,%xmm3 # x^3
+ mov p_j+8(%rsp),%eax # get an individual index
+
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j+8(%rsp) # save the f1 value
+
+ mulps .L__real_half(%rip),%xmm2 # x^2/2
+ mov p_j+12(%rsp),%eax # get an individual index
+
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j+12(%rsp) # save the f1 value
+
+ mulps %xmm3,%xmm4 # *x^3
+ mov p_j2(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j2(%rsp) # save the f1 value
+
+
+ addps %xmm4,%xmm1 # +r2
+
+ addps %xmm2,%xmm1 # + x^2/2
+ addps %xmm1,%xmm0 # +r1
+
+ movaps %xmm8,%xmm9
+ mov p_j2+4(%rsp),%eax # get an individual index
+ movaps %xmm8,%xmm5
+ mulps %xmm5,%xmm5 # x*x
+ mulps .L__real_1_24(%rip),%xmm9 # /24
+
+ movaps %xmm8,%xmm5
+ mulps %xmm5,%xmm5 # x*x
+ mulps .L__real_1_24(%rip),%xmm9 # /24
+
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j2+4(%rsp) # save the f1 value
+
+
+# deal with infinite or denormal results
+ movdqa p_m(%rsp),%xmm1
+ movdqa p_m(%rsp),%xmm2
+ pcmpgtd .L__int_127(%rip),%xmm2
+ pminsw .L__int_128(%rip),%xmm1 # ceil at 128
+ movmskps %xmm2,%eax
+ test $0x0f,%eax
+
+ paddd .L__int_127(%rip),%xmm1 # add bias
+
+# *z2 = f2 + ((f1 + f2) * q);
+ mulps p_j(%rsp),%xmm0 # * f1
+ addps p_j(%rsp),%xmm0 # + f1
+ jnz .L__exp_largef
+.L__check1:
+
+ pxor %xmm2,%xmm2 # floor at 0
+ pmaxsw %xmm2,%xmm1
+
+ pslld $23,%xmm1 # build 2^n
+
+ movaps %xmm1,%xmm2
+
+
+
+# check for infinity or nan
+ movaps p_ux(%rsp),%xmm1
+ andps .L__real_infinity(%rip),%xmm1
+ cmpps $0,.L__real_infinity(%rip),%xmm1
+ movmskps %xmm1,%ebx
+ test $0x0f,%ebx
+
+# end of splitexp
+# /* Scale (z1 + z2) by 2.0**m */
+# Step 3. Reconstitute.
+
+ mulps %xmm2,%xmm0 # result *= 2^n
+
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them. But it adds cycles for normal cases
+# to handle events that are supposed to be exceptions.
+# Using this branch with the
+# check above results in faster code for the normal cases.
+# And branch mispredict penalties should only come into
+# play for nans and infinities.
+ jnz .L__exp_naninf
+.L__vsa_bottom1:
+
+ # q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision
+ addps .L__real_1_6(%rip),%xmm9 # +1/6
+
+ mulps %xmm5,%xmm8 # x^3
+ mov p_j2+8(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j2+8(%rsp) # save the f1 value
+
+ mulps .L__real_half(%rip),%xmm5 # x^2/2
+ mulps %xmm8,%xmm9 # *x^3
+
+ mov p_j2+12(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j2+12(%rsp) # save the f1 value
+
+ addps %xmm9,%xmm7 # +r2
+
+ addps %xmm5,%xmm7 # + x^2/2
+ addps %xmm7,%xmm6 # +r1
+
+
+ # deal with infinite or denormal results
+ movdqa p_m2(%rsp),%xmm7
+ movdqa p_m2(%rsp),%xmm5
+ pcmpgtd .L__int_127(%rip),%xmm5
+ pminsw .L__int_128(%rip),%xmm7 # ceil at 128
+ movmskps %xmm5,%eax
+ test $0x0f,%eax
+
+ paddd .L__int_127(%rip),%xmm7 # add bias
+
+ # *z2 = f2 + ((f1 + f2) * q);
+ mulps p_j2(%rsp),%xmm6 # * f1
+ addps p_j2(%rsp),%xmm6 # + f1
+ jnz .L__exp_largef2
+.L__check2:
+ pxor %xmm1,%xmm1 # floor at 0
+ pmaxsw %xmm1,%xmm7
+
+ pslld $23,%xmm7 # build 2^n
+
+ movaps %xmm7,%xmm1
+
+
+ # check for infinity or nan
+ movaps p_ux2(%rsp),%xmm7
+ andps .L__real_infinity(%rip),%xmm7
+ cmpps $0,.L__real_infinity(%rip),%xmm7
+ movmskps %xmm7,%ebx
+ test $0x0f,%ebx
+
+
+ # end of splitexp
+ # /* Scale (z1 + z2) by 2.0**m */
+ # Step 3. Reconstitute.
+
+ mulps %xmm6,%xmm1 # result *= 2^n
+
+ jnz .L__exp_naninf2
+
+.L__vsa_bottom2:
+
+
+
+#
+.L__final_check:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+# at least one of the numbers needs special treatment
+.L__exp_naninf:
+ lea p_ux(%rsp),%rcx
+ lea p_j(%rsp),%rsi
+ call .L__fexp_naninf
+ jmp .L__vsa_bottom1
+.L__exp_naninf2:
+ lea p_ux2(%rsp),%rcx
+ lea p_j(%rsp),%rsi
+ movaps %xmm0,%xmm2
+ movaps %xmm1,%xmm0
+ call .L__fexp_naninf
+ movaps %xmm0,%xmm1
+ movaps %xmm2,%xmm0
+ jmp .L__vsa_bottom2
+
+# deal with nans and infinities
+# This subroutine checks a packed single for nans and infinities and
+# produces the proper result from the exceptional inputs
+# Register assumptions:
+# Inputs:
+# rbx - mask of errors
+# xmm0 - computed result vector
+# Outputs:
+# xmm0 - new result vector
+# %rax,rdx,rbx,%xmm2 all modified.
+
+.L__fexp_naninf:
+ sub $0x018,%rsp
+ movaps %xmm0,(%rsi) # save the computed values
+ test $1,%ebx # first value?
+ jz .L__Lni2
+ mov 0(%rcx),%edx # get the input
+ call .L__naninf
+ mov %edx,0(%rsi) # copy the result
+.L__Lni2:
+ test $2,%ebx # second value?
+ jz .L__Lni3
+ mov 4(%rcx),%edx # get the input
+ call .L__naninf
+ mov %edx,4(%rsi) # copy the result
+.L__Lni3:
+ test $4,%ebx # third value?
+ jz .L__Lni4
+ mov 8(%rcx),%edx # get the input
+ call .L__naninf
+ mov %edx,8(%rsi) # copy the result
+.L__Lni4:
+ test $8,%ebx # fourth value?
+ jz .L__Lnie
+ mov 12(%rcx),%edx # get the input
+ call .L__naninf
+ mov %edx,12(%rsi) # copy the result
+.L__Lnie:
+ movaps (%rsi),%xmm0 # get the answers
+ add $0x018,%rsp
+ ret
+
+#
+# a simple subroutine to check a scalar input value for infinity
+# or NaN and return the correct result
+# expects input in .Land,%edx returns value in edx. Destroys eax.
+.L__naninf:
+ mov $0x0007FFFFF,%eax
+ test %eax,%edx
+ jnz .L__enan # jump if mantissa not zero, so it's a NaN
+# inf
+ mov %edx,%eax
+ rcl $1,%eax
+ jnc .L__r # exp(+inf) = inf
+ xor %edx,%edx # exp(-inf) = 0
+ jmp .L__r
+
+#NaN
+.L__enan:
+ mov $0x000400000,%eax # convert to quiet
+ or %eax,%edx
+.L__r:
+ ret
+ .align 16
+# deal with m > 127. In some instances, rounding during calculations
+# can result in infinity when it shouldn't. For these cases, we scale
+# m down, and scale the mantissa up.
+
+.L__exp_largef:
+ movdqa %xmm0,p_j(%rsp) # save the mantissa portion
+ movdqa %xmm1,p_m(%rsp) # save the exponent portion
+ mov %eax,%ecx # save the error mask
+ test $1,%ecx # first value?
+ jz .L__Lf2
+ mov p_m(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m(%rsp) # save the exponent
+ movss p_j(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j(%rsp) # save the mantissa
+.L__Lf2:
+ test $2,%ecx # second value?
+ jz .L__Lf3
+ mov p_m+4(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m+4(%rsp) # save the exponent
+ movss p_j+4(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+4(%rsp) # save the mantissa
+.L__Lf3:
+ test $4,%ecx # third value?
+ jz .L__Lf4
+ mov p_m+8(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m+8(%rsp) # save the exponent
+ movss p_j+8(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+8(%rsp) # save the mantissa
+.L__Lf4:
+ test $8,%ecx # fourth value?
+ jz .L__Lfe
+ mov p_m+12(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m+12(%rsp) # save the exponent
+ movss p_j+12(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+12(%rsp) # save the mantissa
+.L__Lfe:
+ movaps p_j(%rsp),%xmm0 # restore the mantissa portion back
+ movdqa p_m(%rsp),%xmm1 # restore the exponent portion
+ jmp .L__check1
+
+ .align 16
+
+.L__exp_largef2:
+ movdqa %xmm6,p_j(%rsp) # save the mantissa portion
+ movdqa %xmm7,p_m2(%rsp) # save the exponent portion
+ mov %eax,%ecx # save the error mask
+ test $1,%ecx # first value?
+ jz .L__Lf22
+ mov p_m2+0(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m2+0(%rsp) # save the exponent
+ movss p_j+0(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+0(%rsp) # save the mantissa
+.L__Lf22:
+ test $2,%ecx # second value?
+ jz .L__Lf32
+ mov p_m2+4(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m2+4(%rsp) # save the exponent
+ movss p_j+4(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+4(%rsp) # save the mantissa
+.L__Lf32:
+ test $4,%ecx # third value?
+ jz .L__Lf42
+ mov p_m2+8(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m2+8(%rsp) # save the exponent
+ movss p_j+8(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+8(%rsp) # save the mantissa
+.L__Lf42:
+ test $8,%ecx # fourth value?
+ jz .L__Lfe2
+ mov p_m2+12(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m2+12(%rsp) # save the exponent
+ movss p_j+12(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+12(%rsp) # save the mantissa
+.L__Lfe2:
+ movaps p_j(%rsp),%xmm6 # restore the mantissa portion back
+ movdqa p_m2(%rsp),%xmm7 # restore the exponent portion
+ jmp .L__check2
+
+ .data # MUCH better performance without this on my tests
+ .align 64
+.L__real_half: .long 0x03f000000 # 1/2
+ .long 0x03f000000
+ .long 0x03f000000
+ .long 0x03f000000
+.L__real_two: .long 0x40000000 # 2
+ .long 0x40000000
+ .long 0x40000000
+ .long 0x40000000
+
+.L__real_8192: .long 0x46000000 # 8192, to protect against really large numbers
+ .long 0x46000000
+ .long 0x46000000
+ .long 0x46000000
+.L__real_m8192: .long 0xC6000000 # -8192, to protect against really small numbers
+ .long 0xC6000000
+ .long 0xC6000000
+ .long 0xC6000000
+.L__real_thirtytwo_by_log2: .long 0x04238AA3B # thirtytwo_by_log2
+ .long 0x04238AA3B
+ .long 0x04238AA3B
+ .long 0x04238AA3B
+.L__real_log2_by_32: .long 0x03CB17218 # log2_by_32
+ .long 0x03CB17218
+ .long 0x03CB17218
+ .long 0x03CB17218
+.L__real_log2_by_32_head: .long 0x03CB17000 # log2_by_32
+ .long 0x03CB17000
+ .long 0x03CB17000
+ .long 0x03CB17000
+.L__real_log2_by_32_tail: .long 0x0B585FDF4 # log2_by_32
+ .long 0x0B585FDF4
+ .long 0x0B585FDF4
+ .long 0x0B585FDF4
+.L__real_1_6: .long 0x03E2AAAAB # 0.16666666666 used in polynomial
+ .long 0x03E2AAAAB
+ .long 0x03E2AAAAB
+ .long 0x03E2AAAAB
+.L__real_1_24: .long 0x03D2AAAAB # 0.041666668 used in polynomial
+ .long 0x03D2AAAAB
+ .long 0x03D2AAAAB
+ .long 0x03D2AAAAB
+.L__real_1_120: .long 0x03C088889 # 0.0083333338 used in polynomial
+ .long 0x03C088889
+ .long 0x03C088889
+ .long 0x03C088889
+.L__real_infinity: .long 0x07f800000 # infinity
+ .long 0x07f800000
+ .long 0x07f800000
+ .long 0x07f800000
+.L__int_mask_1f: .long 0x00000001f
+ .long 0x00000001f
+ .long 0x00000001f
+ .long 0x00000001f
+.L__int_128: .long 0x000000080
+ .long 0x000000080
+ .long 0x000000080
+ .long 0x000000080
+.L__int_127: .long 0x00000007f
+ .long 0x00000007f
+ .long 0x00000007f
+ .long 0x00000007f
+
+.L__two_to_jby32_table:
+ .long 0x03F800000 # 1.0000000000000000
+ .long 0x03F82CD87 # 1.0218971486541166
+ .long 0x03F85AAC3 # 1.0442737824274138
+ .long 0x03F88980F # 1.0671404006768237
+ .long 0x03F8B95C2 # 1.0905077326652577
+ .long 0x03F8EA43A # 1.1143867425958924
+ .long 0x03F91C3D3 # 1.1387886347566916
+ .long 0x03F94F4F0 # 1.1637248587775775
+ .long 0x03F9837F0 # 1.1892071150027210
+ .long 0x03F9B8D3A # 1.2152473599804690
+ .long 0x03F9EF532 # 1.2418578120734840
+ .long 0x03FA27043 # 1.2690509571917332
+ .long 0x03FA5FED7 # 1.2968395546510096
+ .long 0x03FA9A15B # 1.3252366431597413
+ .long 0x03FAD583F # 1.3542555469368927
+ .long 0x03FB123F6 # 1.3839098819638320
+ .long 0x03FB504F3 # 1.4142135623730951
+ .long 0x03FB8FBAF # 1.4451808069770467
+ .long 0x03FBD08A4 # 1.4768261459394993
+ .long 0x03FC12C4D # 1.5091644275934228
+ .long 0x03FC5672A # 1.5422108254079407
+ .long 0x03FC9B9BE # 1.5759808451078865
+ .long 0x03FCE248C # 1.6104903319492543
+ .long 0x03FD2A81E # 1.6457554781539649
+ .long 0x03FD744FD # 1.6817928305074290
+ .long 0x03FDBFBB8 # 1.7186192981224779
+ .long 0x03FE0CCDF # 1.7562521603732995
+ .long 0x03FE5B907 # 1.7947090750031072
+ .long 0x03FEAC0C7 # 1.8340080864093424
+ .long 0x03FEFE4BA # 1.8741676341103000
+ .long 0x03FF5257D # 1.9152065613971474
+ .long 0x03FFA83B3 # 1.9571441241754002
+ .long 0 # for alignment
+
diff --git a/src/gas/vrs8log10f.S b/src/gas/vrs8log10f.S
new file mode 100644
index 0000000..b0a2a67
--- /dev/null
+++ b/src/gas/vrs8log10f.S
@@ -0,0 +1,967 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs8logf.s
+#
+# A vector implementation of the logf libm function.
+# This routine implemented in single precision. It is slightly
+# less accurate than the double precision version, but it will
+# be better for vectorizing.
+#
+# Prototype:
+#
+# __m128,__m128 __vrs8_log10f(__m128 x1, __m128 x2);
+#
+# Computes the natural log of x for eight packed single values.
+# Places the results into xmm0 and xmm1.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error.
+#
+# This array version is basically a unrolling of the by4 scalar single
+# routine. The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+# The scheduling is done by trial and error. The resulting code represents
+# the best time of many variations. It would seem more interleaving could
+# be done, as there is a long stretch of the second computation that is not
+# interleaved. But moving any of this code forward makes the routine
+# slower.
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_x,0 # save x
+.equ p_idx,0x010 # xmmword index
+.equ p_z1,0x020 # xmmword index
+.equ p_q,0x030 # xmmword index
+.equ p_corr,0x040 # xmmword index
+.equ p_omask,0x050 # xmmword index
+.equ save_xmm6,0x060 #
+.equ save_rbx,0x070 #
+.equ save_xmm7,0x080 #
+.equ save_xmm8,0x090 #
+.equ save_xmm9,0x0a0 #
+.equ save_xmm10,0x0b0 #
+.equ save_xmm11,0x0c0 #
+.equ save_xmm12,0x0d0 #
+.equ save_xmm13,0x0d0 #
+.equ p_x2,0x0100 # save x
+.equ p_idx2,0x0110 # xmmword index
+.equ p_z12,0x0120 # xmmword index
+.equ p_q2,0x0130 # xmmword index
+
+.equ stack_size,0x0168
+
+
+
+.globl __vrs8_log10f
+ .type __vrs8_log10f,@function
+__vrs8_log10f:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+# check e as a special case
+ movdqa %xmm0,p_x(%rsp) # save x
+ movdqa %xmm1,p_x2(%rsp) # save x
+# movdqa %xmm0,%xmm2
+# cmpps $0,.L__real_ef(%rip),%xmm2
+# movmskps %xmm2,%r9d
+
+ movdqa %xmm1,%xmm12
+ movdqa %xmm1,%xmm9
+ movaps %xmm1,%xmm7
+
+#
+# compute the index into the log tables
+#
+ movdqa %xmm0,%xmm3
+ movaps %xmm0,%xmm1
+ psrld $23,%xmm3
+
+ #
+ # compute the index into the log tables
+ #
+ psrld $23,%xmm9
+ subps .L__real_one(%rip),%xmm7
+ psubd .L__mask_127(%rip),%xmm9
+ subps .L__real_one(%rip),%xmm1
+ psubd .L__mask_127(%rip),%xmm3
+ cvtdq2ps %xmm9,%xmm13 # xexp
+
+ movdqa %xmm12,%xmm9
+ pand .L__real_mant(%rip),%xmm9
+ xor %r8,%r8
+ movdqa %xmm9,%xmm8
+ movaps .L__real_half(%rip),%xmm11 # .5
+ cvtdq2ps %xmm3,%xmm6 # xexp
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+ xor %r8,%r8
+ movdqa %xmm3,%xmm2
+ movaps .L__real_half(%rip),%xmm5 # .5
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrld $16,%xmm3
+ lea .L__np_ln_lead_table(%rip),%rdx
+ movdqa %xmm3,%xmm4
+ psrld $16,%xmm9
+ movdqa %xmm9,%xmm10
+ psrld $1,%xmm9
+ psrld $1,%xmm3
+ paddd .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm4
+ paddd %xmm4,%xmm3
+ cvtdq2ps %xmm3,%xmm1
+ #/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ paddd .L__mask_040(%rip),%xmm9
+ pand .L__mask_001(%rip),%xmm10
+ paddd %xmm10,%xmm9
+ cvtdq2ps %xmm9,%xmm7
+ packssdw %xmm3,%xmm3
+ movq %xmm3,p_idx(%rsp)
+ packssdw %xmm9,%xmm9
+ movq %xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+ movdqa %xmm0,%xmm3
+ orps .L__real_half(%rip),%xmm2
+
+
+ mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128
+ # reduce and get u
+
+
+ subps %xmm1,%xmm2 # f2 = f - f1
+ mulps %xmm2,%xmm5
+ addps %xmm5,%xmm1
+
+ divps %xmm1,%xmm2 # u
+
+ movdqa %xmm12,%xmm9
+ orps .L__real_half(%rip),%xmm8
+
+
+ mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128
+ subps %xmm7,%xmm8 # f2 = f - f1
+ mulps %xmm8,%xmm11
+ addps %xmm11,%xmm7
+
+
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1(%rsp) # save the f1 values
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1+8(%rsp) # save the f1 value
+ divps %xmm7,%xmm8 # u
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12(%rsp) # save the f1 values
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12+8(%rsp) # save the f1 value
+# solve for ln(1+u)
+ movaps %xmm2,%xmm1 # u
+ mulps %xmm2,%xmm2 # u^2
+ movaps %xmm2,%xmm5
+ movaps .L__real_cb3(%rip),%xmm3
+ mulps %xmm2,%xmm3 #Cu2
+ mulps %xmm1,%xmm5 # u^3
+ addps .L__real_cb2(%rip),%xmm3 #B+Cu2
+ movaps %xmm2,%xmm4
+ mulps %xmm5,%xmm4 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm2
+
+ mulps .L__real_cb1(%rip),%xmm5 #Au3
+ addps %xmm5,%xmm1 # u+Au3
+ mulps %xmm3,%xmm4 # u5(B+Cu2)
+
+ lea .L__np_ln_tail_table(%rip),%rdx
+ addps %xmm4,%xmm1 # poly
+
+# recombine
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q(%rsp) # save the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q+8(%rsp) # save the f2 value
+
+ addps p_q(%rsp),%xmm1 #z2 +=q
+
+ movaps p_z1(%rsp),%xmm0 # z1 values
+
+ mulps %xmm6,%xmm2
+ addps %xmm2,%xmm0 #r1
+ movaps %xmm0,%xmm2
+ mulps .L__real_log2_tail(%rip),%xmm6
+ addps %xmm6,%xmm1 #r2
+ movaps %xmm1,%xmm3
+
+# logef to log10f
+ mulps .L__real_log10e_tail(%rip),%xmm1
+ mulps .L__real_log10e_tail(%rip),%xmm0
+ mulps .L__real_log10e_lead(%rip),%xmm3
+ mulps .L__real_log10e_lead(%rip),%xmm2
+ addps %xmm1,%xmm0
+ addps %xmm3,%xmm0
+ addps %xmm2,%xmm0
+# addps %xmm1,%xmm0
+
+
+
+# check for e
+# test $0x0f,%r9d
+# jnz .L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+ xorps %xmm1,%xmm1
+ cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm1,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg
+
+.L__f2:
+## if +inf
+ movaps p_x(%rsp),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf
+.L__f3:
+
+ movaps p_x(%rsp),%xmm3
+ subps .L__real_one(%rip),%xmm3
+ andps .L__real_notsign(%rip),%xmm3
+ cmpps $2,.L__real_threshold(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one
+.L__f4:
+
+# finish the second set of calculations
+
+ # solve for ln(1+u)
+ movaps %xmm8,%xmm7 # u
+ mulps %xmm8,%xmm8 # u^2
+ movaps %xmm8,%xmm11
+
+ movaps .L__real_cb3(%rip),%xmm9
+ mulps %xmm8,%xmm9 #Cu2
+ mulps %xmm7,%xmm11 # u^3
+ addps .L__real_cb2(%rip),%xmm9 #B+Cu2
+ movaps %xmm8,%xmm10
+ mulps %xmm11,%xmm10 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm8
+
+ mulps .L__real_cb1(%rip),%xmm11 #Au3
+ addps %xmm11,%xmm7 # u+Au3
+ mulps %xmm9,%xmm10 # u5(B+Cu2)
+ addps %xmm10,%xmm7 # poly
+
+
+ # recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2(%rsp) # save the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2+8(%rsp) # save the f2 value
+ addps p_q2(%rsp),%xmm7 #z2 +=q
+ movaps p_z12(%rsp),%xmm1 # z1 values
+
+ mulps %xmm13,%xmm8
+ addps %xmm8,%xmm1 #r1
+ movaps %xmm1,%xmm8
+ mulps .L__real_log2_tail(%rip),%xmm13
+ addps %xmm13,%xmm7 #r2
+ movaps %xmm7,%xmm9
+
+ # logef to log10f
+ mulps .L__real_log10e_tail(%rip),%xmm7
+ mulps .L__real_log10e_tail(%rip),%xmm1
+ mulps .L__real_log10e_lead(%rip),%xmm9
+ mulps .L__real_log10e_lead(%rip),%xmm8
+ addps %xmm7,%xmm1
+ addps %xmm9,%xmm1
+ addps %xmm8,%xmm1
+
+# addps %xmm7,%xmm1
+
+ # check e as a special case
+# movaps p_x2(%rsp),%xmm10
+# cmpps $0,.L__real_ef(%rip),%xmm10
+# movmskps %xmm10,%r9d
+ # check for e
+# test $0x0f,%r9d
+# jnz .L__vlogf_e2
+.L__f12:
+
+ # check for negative numbers or zero
+ xorps %xmm7,%xmm7
+ cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm7,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg2
+
+.L__f22:
+ ## if +inf
+ movaps p_x2(%rsp),%xmm9
+ cmpps $0,.L__real_inf(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf2
+.L__f32:
+
+ movaps p_x2(%rsp),%xmm9
+ subps .L__real_one(%rip),%xmm9
+ andps .L__real_notsign(%rip),%xmm9
+ cmpps $2,.L__real_threshold(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one2
+.L__f42:
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+.L__vlogf_e:
+ movdqa p_x(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ jmp .L__f1
+
+.L__vlogf_e2:
+ movdqa p_x2(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ jmp .L__f12
+
+ .align 16
+.L__near_one:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm3,p_omask(%rsp) # save ones mask
+ movaps p_x(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+# u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm1
+ divps %xmm2,%xmm1 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm1,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+# u = u + u;
+ addps %xmm1,%xmm1 #u
+ movaps %xmm1,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm1,%xmm5 # Cu
+ movaps %xmm1,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm1
+ mulps %xmm1,%xmm1 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm1 #u6(Cu+Du3)
+ addps %xmm1,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+# loge to log10
+ movaps %xmm3,%xmm5 #r1=r
+ pand .L__mask_lower(%rip),%xmm5
+ subps %xmm5,%xmm3
+ addps %xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+ movaps %xmm5,%xmm3
+ movaps %xmm2,%xmm1
+
+ mulps .L__real_log10e_tail(%rip),%xmm2
+ mulps .L__real_log10e_tail(%rip),%xmm3
+ mulps .L__real_log10e_lead(%rip),%xmm1
+ mulps .L__real_log10e_lead(%rip),%xmm5
+ addps %xmm2,%xmm3
+ addps %xmm1,%xmm3
+ addps %xmm5,%xmm3
+# return r + r2;
+# addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm0,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+
+ jmp .L__f4
+
+
+ .align 16
+.L__near_one2:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm9,p_omask(%rsp) # save ones mask
+ movaps p_x2(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+ # u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm7
+ divps %xmm2,%xmm7 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+ # correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm7,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+ # u = u + u;
+ addps %xmm7,%xmm7 #u
+ movaps %xmm7,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+ # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm7,%xmm5 # Cu
+ movaps %xmm7,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm7
+ mulps %xmm7,%xmm7 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm7 #u6(Cu+Du3)
+ addps %xmm7,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+ #loge to log10
+ movaps %xmm3,%xmm5 #r1=r
+ pand .L__mask_lower(%rip),%xmm5
+ subps %xmm5,%xmm3
+ addps %xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+ movaps %xmm5,%xmm3
+ movaps %xmm2,%xmm7
+
+ mulps .L__real_log10e_tail(%rip),%xmm2
+ mulps .L__real_log10e_tail(%rip),%xmm3
+ mulps .L__real_log10e_lead(%rip),%xmm7
+ mulps .L__real_log10e_lead(%rip),%xmm5
+ addps %xmm2,%xmm3
+ addps %xmm7,%xmm3
+ addps %xmm5,%xmm3
+
+ # return r + r2;
+# addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm1,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+
+ jmp .L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+ movdqa %xmm1,%xmm3
+ andps %xmm0,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm1 # setup the nan values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+# check for +/- 0
+ xorps %xmm1,%xmm1
+ cmpps $0,p_x(%rsp),%xmm1 # 0 ?.
+ movmskps %xmm1,%r9d
+ test $0x0f,%r9d
+ jz .L__zn2
+
+ movdqa %xmm1,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+
+.L__zn2:
+# check for NaNs
+ movaps p_x(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x(%rsp),%xmm1 # isolate the NaNs
+ pand %xmm4,%xmm1
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm1,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm1
+ andnps %xmm0,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f2
+
+# handle only +inf log(+inf) = inf
+.L__log_inf:
+ movdqa %xmm3,%xmm1
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps p_x(%rsp),%xmm1 # setup the +inf values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+ jmp .L__f3
+
+
+.L__z_or_neg2:
+ # deal with negatives first
+ movdqa %xmm7,%xmm3
+ andps %xmm1,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm7 # setup the nan values
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ # check for +/- 0
+ xorps %xmm7,%xmm7
+ cmpps $0,p_x2(%rsp),%xmm7 # 0 ?.
+ movmskps %xmm7,%r9d
+ test $0x0f,%r9d
+ jz .L__zn22
+
+ movdqa %xmm7,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+
+.L__zn22:
+ # check for NaNs
+ movaps p_x2(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x2(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x2(%rsp),%xmm7 # isolate the NaNs
+ pand %xmm4,%xmm7
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm7,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm7
+ andnps %xmm1,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f22
+
+ # handle only +inf log(+inf) = inf
+.L__log_inf2:
+ movdqa %xmm9,%xmm7
+ andnps %xmm1,%xmm9 # keep the non-error values
+ andps p_x2(%rsp),%xmm7 # setup the +inf values
+ orps %xmm9,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ jmp .L__f32
+
+
+ .data
+ .align 64
+
+.L__real_zero: .quad 0x00000000000000000 # 1.0
+ .quad 0x00000000000000000
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03c0000003c000000
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+.L__mask_040: .quad 0x00000004000000040 #
+ .quad 0x00000004000000040
+.L__mask_001: .quad 0x00000000100000001 #
+ .quad 0x00000000100000001
+
+
+.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03
+ .quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03
+ .quad 0x03B1249183B124918
+.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04
+ .quad 0x039E401A639E401A6
+.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03
+ .quad 0x03B124A123B124A12
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+
+.L__real_log10e_lead: .quad 0x03EDE00003EDE0000 # log10e_lead 0.4335937500
+ .quad 0x03EDE00003EDE0000
+.L__real_log10e_tail: .quad 0x03A37B1523A37B152 # log10e_tail 0.0007007319
+ .quad 0x03A37B1523A37B152
+
+
+.L__mask_lower: .quad 0x0ffff0000ffff0000 #
+ .quad 0x0ffff0000ffff0000
+.L__np_ln__table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02
+ .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02
+ .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02
+ .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02
+ .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02
+ .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02
+ .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01
+ .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01
+ .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01
+ .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01
+ .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01
+ .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01
+ .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01
+ .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01
+ .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01
+ .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01
+ .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01
+ .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01
+ .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01
+ .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01
+ .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01
+ .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01
+ .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01
+ .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01
+ .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01
+ .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01
+ .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01
+ .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01
+ .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01
+ .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01
+ .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01
+ .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01
+ .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01
+ .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01
+ .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01
+ .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01
+ .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01
+ .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01
+ .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01
+ .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01
+ .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01
+ .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01
+ .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01
+ .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01
+ .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01
+ .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01
+ .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01
+ .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01
+ .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01
+ .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01
+ .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01
+ .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01
+ .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01
+ .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01
+ .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01
+ .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01
+ .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01
+ .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01
+ .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01
+ .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01
+ .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01
+ .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01
+ .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01
+ .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_lead_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x3C7E0000 # 0.015502929688 1
+ .long 0x3CFC1000 # 0.030769348145 2
+ .long 0x3D3BA000 # 0.045806884766 3
+ .long 0x3D785000 # 0.060623168945 4
+ .long 0x3D9A0000 # 0.075195312500 5
+ .long 0x3DB78000 # 0.089599609375 6
+ .long 0x3DD49000 # 0.103790283203 7
+ .long 0x3DF13000 # 0.117767333984 8
+ .long 0x3E06B000 # 0.131530761719 9
+ .long 0x3E14A000 # 0.145141601563 10
+ .long 0x3E226000 # 0.158569335938 11
+ .long 0x3E2FF000 # 0.171813964844 12
+ .long 0x3E3D5000 # 0.184875488281 13
+ .long 0x3E4A9000 # 0.197814941406 14
+ .long 0x3E579000 # 0.210510253906 15
+ .long 0x3E647000 # 0.223083496094 16
+ .long 0x3E713000 # 0.235534667969 17
+ .long 0x3E7DC000 # 0.247802734375 18
+ .long 0x3E851000 # 0.259887695313 19
+ .long 0x3E8B3000 # 0.271850585938 20
+ .long 0x3E914000 # 0.283691406250 21
+ .long 0x3E974000 # 0.295410156250 22
+ .long 0x3E9D3000 # 0.307006835938 23
+ .long 0x3EA30000 # 0.318359375000 24
+ .long 0x3EA8D000 # 0.329711914063 25
+ .long 0x3EAE8000 # 0.340820312500 26
+ .long 0x3EB43000 # 0.351928710938 27
+ .long 0x3EB9C000 # 0.362792968750 28
+ .long 0x3EBF5000 # 0.373657226563 29
+ .long 0x3EC4D000 # 0.384399414063 30
+ .long 0x3ECA3000 # 0.394897460938 31
+ .long 0x3ECF9000 # 0.405395507813 32
+ .long 0x3ED4E000 # 0.415771484375 33
+ .long 0x3EDA2000 # 0.426025390625 34
+ .long 0x3EDF5000 # 0.436157226563 35
+ .long 0x3EE47000 # 0.446166992188 36
+ .long 0x3EE99000 # 0.456176757813 37
+ .long 0x3EEEA000 # 0.466064453125 38
+ .long 0x3EF3A000 # 0.475830078125 39
+ .long 0x3EF89000 # 0.485473632813 40
+ .long 0x3EFD7000 # 0.494995117188 41
+ .long 0x3F012000 # 0.504394531250 42
+ .long 0x3F039000 # 0.513916015625 43
+ .long 0x3F05F000 # 0.523193359375 44
+ .long 0x3F084000 # 0.532226562500 45
+ .long 0x3F0AA000 # 0.541503906250 46
+ .long 0x3F0CF000 # 0.550537109375 47
+ .long 0x3F0F4000 # 0.559570312500 48
+ .long 0x3F118000 # 0.568359375000 49
+ .long 0x3F13C000 # 0.577148437500 50
+ .long 0x3F160000 # 0.585937500000 51
+ .long 0x3F183000 # 0.594482421875 52
+ .long 0x3F1A7000 # 0.603271484375 53
+ .long 0x3F1C9000 # 0.611572265625 54
+ .long 0x3F1EC000 # 0.620117187500 55
+ .long 0x3F20E000 # 0.628417968750 56
+ .long 0x3F230000 # 0.636718750000 57
+ .long 0x3F252000 # 0.645019531250 58
+ .long 0x3F273000 # 0.653076171875 59
+ .long 0x3F295000 # 0.661376953125 60
+ .long 0x3F2B5000 # 0.669189453125 61
+ .long 0x3F2D6000 # 0.677246093750 62
+ .long 0x3F2F7000 # 0.685302734375 63
+ .long 0x3F317000 # 0.693115234375 64
+ .long 0 # for alignment
+
+.L__np_ln_tail_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x35A8B0FC # 0.000001256848 1
+ .long 0x361B0E78 # 0.000002310522 2
+ .long 0x3631EC66 # 0.000002651266 3
+ .long 0x35C30046 # 0.000001452871 4
+ .long 0x37EBCB0E # 0.000028108738 5
+ .long 0x37528AE5 # 0.000012549314 6
+ .long 0x36DA7496 # 0.000006510479 7
+ .long 0x3783B715 # 0.000015701671 8
+ .long 0x383F3E68 # 0.000045596069 9
+ .long 0x38297C10 # 0.000040408282 10
+ .long 0x3815B666 # 0.000035694240 11
+ .long 0x38183854 # 0.000036292084 12
+ .long 0x38448108 # 0.000046850211 13
+ .long 0x373539E9 # 0.000010801924 14
+ .long 0x3864A740 # 0.000054515200 15
+ .long 0x387BE3CD # 0.000060055219 16
+ .long 0x3803B715 # 0.000031403342 17
+ .long 0x380C36AF # 0.000033429529 18
+ .long 0x3892713A # 0.000069829126 19
+ .long 0x38AE55D6 # 0.000083129547 20
+ .long 0x38A0FDE8 # 0.000076766883 21
+ .long 0x3862BAE1 # 0.000054056643 22
+ .long 0x3798AAD3 # 0.000018199358 23
+ .long 0x38C5E10E # 0.000094356117 24
+ .long 0x382D872E # 0.000041372310 25
+ .long 0x38DEDFAC # 0.000106274470 26
+ .long 0x38481E9B # 0.000047712219 27
+ .long 0x38EBFB5E # 0.000112524940 28
+ .long 0x38783B83 # 0.000059183232 29
+ .long 0x374E1B05 # 0.000012284848 30
+ .long 0x38CA0E11 # 0.000096347307 31
+ .long 0x3891F660 # 0.000069600297 32
+ .long 0x386C9A9A # 0.000056410769 33
+ .long 0x38777BCD # 0.000059004688 34
+ .long 0x38A6CED4 # 0.000079540216 35
+ .long 0x38FBE3CD # 0.000120110439 36
+ .long 0x387E7E01 # 0.000060675669 37
+ .long 0x37D40984 # 0.000025276800 38
+ .long 0x3784C3AD # 0.000015826745 39
+ .long 0x380F5FAF # 0.000034182969 40
+ .long 0x38AC47BC # 0.000082149607 41
+ .long 0x392952D3 # 0.000161479504 42
+ .long 0x37F97073 # 0.000029735476 43
+ .long 0x3865C84A # 0.000054784388 44
+ .long 0x3979CF17 # 0.000238236375 45
+ .long 0x38C3D2F5 # 0.000093376184 46
+ .long 0x38E6B468 # 0.000110008579 47
+ .long 0x383EBCE1 # 0.000045475437 48
+ .long 0x39186BDF # 0.000145360347 49
+ .long 0x392F0945 # 0.000166927537 50
+ .long 0x38E9ED45 # 0.000111545007 51
+ .long 0x396B99A8 # 0.000224685878 52
+ .long 0x37A27674 # 0.000019367064 53
+ .long 0x397069AB # 0.000229275480 54
+ .long 0x39013539 # 0.000123222257 55
+ .long 0x3947F423 # 0.000190690669 56
+ .long 0x3945E10E # 0.000188712234 57
+ .long 0x38F85DB0 # 0.000118430122 58
+ .long 0x396C08DC # 0.000225100142 59
+ .long 0x37B4996F # 0.000021529120 60
+ .long 0x397CEADA # 0.000241200818 61
+ .long 0x3920261B # 0.000152729845 62
+ .long 0x35AA4906 # 0.000001268724 63
+ .long 0x3805FDF4 # 0.000031946183 64
+ .long 0 # for alignment
+
+
diff --git a/src/gas/vrs8log2f.S b/src/gas/vrs8log2f.S
new file mode 100644
index 0000000..d1028b0
--- /dev/null
+++ b/src/gas/vrs8log2f.S
@@ -0,0 +1,956 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs8log2f.s
+#
+# A vector implementation of the logf libm function.
+# This routine implemented in single precision. It is slightly
+# less accurate than the double precision version, but it will
+# be better for vectorizing.
+#
+# Prototype:
+#
+# __m128,__m128 __vrs8_log2f(__m128 x1, __m128 x2);
+#
+# Computes the natural log of x for eight packed single values.
+# Places the results into xmm0 and xmm1.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error.
+#
+# This array version is basically a unrolling of the by4 scalar single
+# routine. The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+# The scheduling is done by trial and error. The resulting code represents
+# the best time of many variations. It would seem more interleaving could
+# be done, as there is a long stretch of the second computation that is not
+# interleaved. But moving any of this code forward makes the routine
+# slower.
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_x,0 # save x
+.equ p_idx,0x010 # xmmword index
+.equ p_z1,0x020 # xmmword index
+.equ p_q,0x030 # xmmword index
+.equ p_corr,0x040 # xmmword index
+.equ p_omask,0x050 # xmmword index
+.equ save_xmm6,0x060 #
+.equ save_rbx,0x070 #
+.equ save_xmm7,0x080 #
+.equ save_xmm8,0x090 #
+.equ save_xmm9,0x0a0 #
+.equ save_xmm10,0x0b0 #
+.equ save_xmm11,0x0c0 #
+.equ save_xmm12,0x0d0 #
+.equ save_xmm13,0x0d0 #
+.equ p_x2,0x0100 # save x
+.equ p_idx2,0x0110 # xmmword index
+.equ p_z12,0x0120 # xmmword index
+.equ p_q2,0x0130 # xmmword index
+
+.equ stack_size,0x0168
+
+
+
+.globl __vrs8_log2f
+ .type __vrs8_log2f,@function
+__vrs8_log2f:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+# check e as a special case
+ movdqa %xmm0,p_x(%rsp) # save x
+ movdqa %xmm1,p_x2(%rsp) # save x
+# movdqa %xmm0,%xmm2
+# cmpps $0,.L__real_ef(%rip),%xmm2
+# movmskps %xmm2,%r9d
+
+ movdqa %xmm1,%xmm12
+ movdqa %xmm1,%xmm9
+ movaps %xmm1,%xmm7
+
+#
+# compute the index into the log tables
+#
+ movdqa %xmm0,%xmm3
+ movaps %xmm0,%xmm1
+ psrld $23,%xmm3
+
+ #
+ # compute the index into the log tables
+ #
+ psrld $23,%xmm9
+ subps .L__real_one(%rip),%xmm7
+ psubd .L__mask_127(%rip),%xmm9
+ subps .L__real_one(%rip),%xmm1
+ psubd .L__mask_127(%rip),%xmm3
+ cvtdq2ps %xmm9,%xmm13 # xexp
+
+ movdqa %xmm12,%xmm9
+ pand .L__real_mant(%rip),%xmm9
+ xor %r8,%r8
+ movdqa %xmm9,%xmm8
+ movaps .L__real_half(%rip),%xmm11 # .5
+ cvtdq2ps %xmm3,%xmm6 # xexp
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+ xor %r8,%r8
+ movdqa %xmm3,%xmm2
+ movaps .L__real_half(%rip),%xmm5 # .5
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrld $16,%xmm3
+ lea .L__np_ln_lead_table(%rip),%rdx
+ movdqa %xmm3,%xmm4
+ psrld $16,%xmm9
+ movdqa %xmm9,%xmm10
+ psrld $1,%xmm9
+ psrld $1,%xmm3
+ paddd .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm4
+ paddd %xmm4,%xmm3
+ cvtdq2ps %xmm3,%xmm1
+ #/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ paddd .L__mask_040(%rip),%xmm9
+ pand .L__mask_001(%rip),%xmm10
+ paddd %xmm10,%xmm9
+ cvtdq2ps %xmm9,%xmm7
+ packssdw %xmm3,%xmm3
+ movq %xmm3,p_idx(%rsp)
+ packssdw %xmm9,%xmm9
+ movq %xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+ movdqa %xmm0,%xmm3
+ orps .L__real_half(%rip),%xmm2
+
+
+ mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128
+ # reduce and get u
+
+
+ subps %xmm1,%xmm2 # f2 = f - f1
+ mulps %xmm2,%xmm5
+ addps %xmm5,%xmm1
+
+ divps %xmm1,%xmm2 # u
+
+ movdqa %xmm12,%xmm9
+ orps .L__real_half(%rip),%xmm8
+
+
+ mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128
+ subps %xmm7,%xmm8 # f2 = f - f1
+ mulps %xmm8,%xmm11
+ addps %xmm11,%xmm7
+
+
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1(%rsp) # save the f1 values
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1+8(%rsp) # save the f1 value
+ divps %xmm7,%xmm8 # u
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12(%rsp) # save the f1 values
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12+8(%rsp) # save the f1 value
+# solve for ln(1+u)
+ movaps %xmm2,%xmm1 # u
+ mulps %xmm2,%xmm2 # u^2
+ movaps %xmm2,%xmm5
+ movaps .L__real_cb3(%rip),%xmm3
+ mulps %xmm2,%xmm3 #Cu2
+ mulps %xmm1,%xmm5 # u^3
+ addps .L__real_cb2(%rip),%xmm3 #B+Cu2
+ movaps %xmm2,%xmm4
+ mulps %xmm5,%xmm4 # u^5
+ movaps .L__real_log2e_lead(%rip),%xmm2
+
+ mulps .L__real_cb1(%rip),%xmm5 #Au3
+ addps %xmm5,%xmm1 # u+Au3
+ mulps %xmm3,%xmm4 # u5(B+Cu2)
+ movaps .L__real_log2e_tail(%rip),%xmm3
+ lea .L__np_ln_tail_table(%rip),%rdx
+ addps %xmm4,%xmm1 # poly
+
+# recombine
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q(%rsp) # save the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q+8(%rsp) # save the f2 value
+
+ addps p_q(%rsp),%xmm1 #z2 +=q
+ movaps %xmm1,%xmm4 #z2 copy
+ movaps p_z1(%rsp),%xmm0 # z1 values
+ movaps %xmm0,%xmm5 #z1 copy
+
+ mulps %xmm2,%xmm5 #z1*log2e_lead
+ mulps %xmm2,%xmm1 #z2*log2e_lead
+ mulps %xmm3,%xmm4 #z2*log2e_tail
+ mulps %xmm3,%xmm0 #z1*log2e_tail
+ addps %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp
+ addps %xmm4,%xmm0 #z1*log2e_tail + z2*log2e_tail
+ addps %xmm1,%xmm0 #r2
+#return r1+r2
+ addps %xmm5,%xmm0 # r1+ r2
+
+
+
+# check for e
+# test $0x0f,%r9d
+# jnz .L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+ xorps %xmm1,%xmm1
+ cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm1,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg
+
+.L__f2:
+## if +inf
+ movaps p_x(%rsp),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf
+.L__f3:
+
+ movaps p_x(%rsp),%xmm3
+ subps .L__real_one(%rip),%xmm3
+ andps .L__real_notsign(%rip),%xmm3
+ cmpps $2,.L__real_threshold(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one
+.L__f4:
+
+# finish the second set of calculations
+
+ # solve for ln(1+u)
+ movaps %xmm8,%xmm7 # u
+ mulps %xmm8,%xmm8 # u^2
+ movaps %xmm8,%xmm11
+
+ movaps .L__real_cb3(%rip),%xmm9
+ mulps %xmm8,%xmm9 #Cu2
+ mulps %xmm7,%xmm11 # u^3
+ addps .L__real_cb2(%rip),%xmm9 #B+Cu2
+ movaps %xmm8,%xmm10
+ mulps %xmm11,%xmm10 # u^5
+ movaps .L__real_log2e_lead(%rip),%xmm8
+
+ mulps .L__real_cb1(%rip),%xmm11 #Au3
+ addps %xmm11,%xmm7 # u+Au3
+ mulps %xmm9,%xmm10 # u5(B+Cu2)
+ movaps .L__real_log2e_tail(%rip),%xmm9
+ addps %xmm10,%xmm7 # poly
+
+
+ # recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2(%rsp) # save the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2+8(%rsp) # save the f2 value
+ addps p_q2(%rsp),%xmm7 #z2 +=q
+ movaps %xmm7,%xmm10 #z2 copy
+ movaps p_z12(%rsp),%xmm1 # z1 values
+ movaps %xmm1,%xmm11 #z1 copy
+
+ mulps %xmm8,%xmm11 #z1*log2e_lead
+ mulps %xmm8,%xmm7 #z2*log2e_lead
+ mulps %xmm9,%xmm10 #z2*log2e_tail
+ mulps %xmm9,%xmm1 #z1*log2e_tail
+ addps %xmm13,%xmm11 #r1 = z1*log2e_lead + xexp
+ addps %xmm10,%xmm1 #z1*log2e_tail + z2*log2e_tail
+ addps %xmm7,%xmm1 #r2
+ #return r1+r2
+ addps %xmm11,%xmm1 # r1+ r2
+
+ # check e as a special case
+# movaps p_x2(%rsp),%xmm10
+# cmpps $0,.L__real_ef(%rip),%xmm10
+# movmskps %xmm10,%r9d
+ # check for e
+# test $0x0f,%r9d
+# jnz .L__vlogf_e2
+.L__f12:
+
+ # check for negative numbers or zero
+ xorps %xmm7,%xmm7
+ cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm7,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg2
+
+.L__f22:
+ ## if +inf
+ movaps p_x2(%rsp),%xmm9
+ cmpps $0,.L__real_inf(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf2
+.L__f32:
+
+ movaps p_x2(%rsp),%xmm9
+ subps .L__real_one(%rip),%xmm9
+ andps .L__real_notsign(%rip),%xmm9
+ cmpps $2,.L__real_threshold(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one2
+.L__f42:
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+.L__vlogf_e:
+ movdqa p_x(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ jmp .L__f1
+
+.L__vlogf_e2:
+ movdqa p_x2(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ jmp .L__f12
+
+ .align 16
+.L__near_one:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm3,p_omask(%rsp) # save ones mask
+ movaps p_x(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+# u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm1
+ divps %xmm2,%xmm1 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm1,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+# u = u + u;
+ addps %xmm1,%xmm1 #u
+ movaps %xmm1,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm1,%xmm5 # Cu
+ movaps %xmm1,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm1
+ mulps %xmm1,%xmm1 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm1 #u6(Cu+Du3)
+ addps %xmm1,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+# loge to log2
+ movaps %xmm3,%xmm5 #r1=r
+ pand .L__mask_lower(%rip),%xmm5
+ subps %xmm5,%xmm3
+ addps %xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+ movaps %xmm5,%xmm3
+ movaps %xmm2,%xmm1
+
+ mulps .L__real_log2e_tail(%rip),%xmm2
+ mulps .L__real_log2e_tail(%rip),%xmm3
+ mulps .L__real_log2e_lead(%rip),%xmm1
+ mulps .L__real_log2e_lead(%rip),%xmm5
+ addps %xmm2,%xmm3
+ addps %xmm1,%xmm3
+ addps %xmm5,%xmm3
+
+# return r + r2;
+# addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm0,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+
+ jmp .L__f4
+
+
+ .align 16
+.L__near_one2:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm9,p_omask(%rsp) # save ones mask
+ movaps p_x2(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+ # u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm7
+ divps %xmm2,%xmm7 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+ # correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm7,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+ # u = u + u;
+ addps %xmm7,%xmm7 #u
+ movaps %xmm7,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+ # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm7,%xmm5 # Cu
+ movaps %xmm7,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm7
+ mulps %xmm7,%xmm7 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm7 #u6(Cu+Du3)
+ addps %xmm7,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+# loge to log2
+ movaps %xmm3,%xmm5 #r1=r
+ pand .L__mask_lower(%rip),%xmm5
+ subps %xmm5,%xmm3
+ addps %xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+ movaps %xmm5,%xmm3
+ movaps %xmm2,%xmm7
+
+ mulps .L__real_log2e_tail(%rip),%xmm2
+ mulps .L__real_log2e_tail(%rip),%xmm3
+ mulps .L__real_log2e_lead(%rip),%xmm7
+ mulps .L__real_log2e_lead(%rip),%xmm5
+ addps %xmm2,%xmm3
+ addps %xmm7,%xmm3
+ addps %xmm5,%xmm3
+
+ # return r + r2;
+ # addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm1,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+
+ jmp .L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+ movdqa %xmm1,%xmm3
+ andps %xmm0,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm1 # setup the nan values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+# check for +/- 0
+ xorps %xmm1,%xmm1
+ cmpps $0,p_x(%rsp),%xmm1 # 0 ?.
+ movmskps %xmm1,%r9d
+ test $0x0f,%r9d
+ jz .L__zn2
+
+ movdqa %xmm1,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+
+.L__zn2:
+# check for NaNs
+ movaps p_x(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x(%rsp),%xmm1 # isolate the NaNs
+ pand %xmm4,%xmm1
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm1,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm1
+ andnps %xmm0,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f2
+
+# handle only +inf log(+inf) = inf
+.L__log_inf:
+ movdqa %xmm3,%xmm1
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps p_x(%rsp),%xmm1 # setup the +inf values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+ jmp .L__f3
+
+
+.L__z_or_neg2:
+ # deal with negatives first
+ movdqa %xmm7,%xmm3
+ andps %xmm1,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm7 # setup the nan values
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ # check for +/- 0
+ xorps %xmm7,%xmm7
+ cmpps $0,p_x2(%rsp),%xmm7 # 0 ?.
+ movmskps %xmm7,%r9d
+ test $0x0f,%r9d
+ jz .L__zn22
+
+ movdqa %xmm7,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+
+.L__zn22:
+ # check for NaNs
+ movaps p_x2(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x2(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x2(%rsp),%xmm7 # isolate the NaNs
+ pand %xmm4,%xmm7
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm7,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm7
+ andnps %xmm1,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f22
+
+ # handle only +inf log(+inf) = inf
+.L__log_inf2:
+ movdqa %xmm9,%xmm7
+ andnps %xmm1,%xmm9 # keep the non-error values
+ andps p_x2(%rsp),%xmm7 # setup the +inf values
+ orps %xmm9,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ jmp .L__f32
+
+
+ .data
+ .align 64
+
+.L__real_zero: .quad 0x00000000000000000 # 1.0
+ .quad 0x00000000000000000
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03c0000003c000000
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+.L__mask_040: .quad 0x00000004000000040 #
+ .quad 0x00000004000000040
+.L__mask_001: .quad 0x00000000100000001 #
+ .quad 0x00000000100000001
+
+
+.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03
+ .quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03
+ .quad 0x03B1249183B124918
+.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04
+ .quad 0x039E401A639E401A6
+.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03
+ .quad 0x03B124A123B124A12
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+.L__real_log2e_lead: .quad 0x03FB800003FB80000 #1.4375000000
+ .quad 0x03FB800003FB80000
+.L__real_log2e_tail: .quad 0x03BAA3B293BAA3B29 # 0.0051950408889633
+ .quad 0x03BAA3B293BAA3B29
+
+.L__mask_lower: .quad 0x0ffff0000ffff0000 #
+ .quad 0x0ffff0000ffff0000
+
+.L__np_ln__table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02
+ .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02
+ .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02
+ .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02
+ .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02
+ .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02
+ .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01
+ .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01
+ .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01
+ .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01
+ .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01
+ .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01
+ .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01
+ .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01
+ .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01
+ .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01
+ .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01
+ .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01
+ .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01
+ .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01
+ .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01
+ .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01
+ .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01
+ .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01
+ .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01
+ .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01
+ .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01
+ .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01
+ .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01
+ .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01
+ .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01
+ .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01
+ .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01
+ .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01
+ .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01
+ .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01
+ .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01
+ .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01
+ .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01
+ .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01
+ .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01
+ .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01
+ .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01
+ .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01
+ .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01
+ .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01
+ .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01
+ .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01
+ .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01
+ .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01
+ .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01
+ .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01
+ .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01
+ .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01
+ .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01
+ .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01
+ .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01
+ .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01
+ .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01
+ .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01
+ .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01
+ .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01
+ .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01
+ .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_lead_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x3C7E0000 # 0.015502929688 1
+ .long 0x3CFC1000 # 0.030769348145 2
+ .long 0x3D3BA000 # 0.045806884766 3
+ .long 0x3D785000 # 0.060623168945 4
+ .long 0x3D9A0000 # 0.075195312500 5
+ .long 0x3DB78000 # 0.089599609375 6
+ .long 0x3DD49000 # 0.103790283203 7
+ .long 0x3DF13000 # 0.117767333984 8
+ .long 0x3E06B000 # 0.131530761719 9
+ .long 0x3E14A000 # 0.145141601563 10
+ .long 0x3E226000 # 0.158569335938 11
+ .long 0x3E2FF000 # 0.171813964844 12
+ .long 0x3E3D5000 # 0.184875488281 13
+ .long 0x3E4A9000 # 0.197814941406 14
+ .long 0x3E579000 # 0.210510253906 15
+ .long 0x3E647000 # 0.223083496094 16
+ .long 0x3E713000 # 0.235534667969 17
+ .long 0x3E7DC000 # 0.247802734375 18
+ .long 0x3E851000 # 0.259887695313 19
+ .long 0x3E8B3000 # 0.271850585938 20
+ .long 0x3E914000 # 0.283691406250 21
+ .long 0x3E974000 # 0.295410156250 22
+ .long 0x3E9D3000 # 0.307006835938 23
+ .long 0x3EA30000 # 0.318359375000 24
+ .long 0x3EA8D000 # 0.329711914063 25
+ .long 0x3EAE8000 # 0.340820312500 26
+ .long 0x3EB43000 # 0.351928710938 27
+ .long 0x3EB9C000 # 0.362792968750 28
+ .long 0x3EBF5000 # 0.373657226563 29
+ .long 0x3EC4D000 # 0.384399414063 30
+ .long 0x3ECA3000 # 0.394897460938 31
+ .long 0x3ECF9000 # 0.405395507813 32
+ .long 0x3ED4E000 # 0.415771484375 33
+ .long 0x3EDA2000 # 0.426025390625 34
+ .long 0x3EDF5000 # 0.436157226563 35
+ .long 0x3EE47000 # 0.446166992188 36
+ .long 0x3EE99000 # 0.456176757813 37
+ .long 0x3EEEA000 # 0.466064453125 38
+ .long 0x3EF3A000 # 0.475830078125 39
+ .long 0x3EF89000 # 0.485473632813 40
+ .long 0x3EFD7000 # 0.494995117188 41
+ .long 0x3F012000 # 0.504394531250 42
+ .long 0x3F039000 # 0.513916015625 43
+ .long 0x3F05F000 # 0.523193359375 44
+ .long 0x3F084000 # 0.532226562500 45
+ .long 0x3F0AA000 # 0.541503906250 46
+ .long 0x3F0CF000 # 0.550537109375 47
+ .long 0x3F0F4000 # 0.559570312500 48
+ .long 0x3F118000 # 0.568359375000 49
+ .long 0x3F13C000 # 0.577148437500 50
+ .long 0x3F160000 # 0.585937500000 51
+ .long 0x3F183000 # 0.594482421875 52
+ .long 0x3F1A7000 # 0.603271484375 53
+ .long 0x3F1C9000 # 0.611572265625 54
+ .long 0x3F1EC000 # 0.620117187500 55
+ .long 0x3F20E000 # 0.628417968750 56
+ .long 0x3F230000 # 0.636718750000 57
+ .long 0x3F252000 # 0.645019531250 58
+ .long 0x3F273000 # 0.653076171875 59
+ .long 0x3F295000 # 0.661376953125 60
+ .long 0x3F2B5000 # 0.669189453125 61
+ .long 0x3F2D6000 # 0.677246093750 62
+ .long 0x3F2F7000 # 0.685302734375 63
+ .long 0x3F317000 # 0.693115234375 64
+ .long 0 # for alignment
+
+.L__np_ln_tail_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x35A8B0FC # 0.000001256848 1
+ .long 0x361B0E78 # 0.000002310522 2
+ .long 0x3631EC66 # 0.000002651266 3
+ .long 0x35C30046 # 0.000001452871 4
+ .long 0x37EBCB0E # 0.000028108738 5
+ .long 0x37528AE5 # 0.000012549314 6
+ .long 0x36DA7496 # 0.000006510479 7
+ .long 0x3783B715 # 0.000015701671 8
+ .long 0x383F3E68 # 0.000045596069 9
+ .long 0x38297C10 # 0.000040408282 10
+ .long 0x3815B666 # 0.000035694240 11
+ .long 0x38183854 # 0.000036292084 12
+ .long 0x38448108 # 0.000046850211 13
+ .long 0x373539E9 # 0.000010801924 14
+ .long 0x3864A740 # 0.000054515200 15
+ .long 0x387BE3CD # 0.000060055219 16
+ .long 0x3803B715 # 0.000031403342 17
+ .long 0x380C36AF # 0.000033429529 18
+ .long 0x3892713A # 0.000069829126 19
+ .long 0x38AE55D6 # 0.000083129547 20
+ .long 0x38A0FDE8 # 0.000076766883 21
+ .long 0x3862BAE1 # 0.000054056643 22
+ .long 0x3798AAD3 # 0.000018199358 23
+ .long 0x38C5E10E # 0.000094356117 24
+ .long 0x382D872E # 0.000041372310 25
+ .long 0x38DEDFAC # 0.000106274470 26
+ .long 0x38481E9B # 0.000047712219 27
+ .long 0x38EBFB5E # 0.000112524940 28
+ .long 0x38783B83 # 0.000059183232 29
+ .long 0x374E1B05 # 0.000012284848 30
+ .long 0x38CA0E11 # 0.000096347307 31
+ .long 0x3891F660 # 0.000069600297 32
+ .long 0x386C9A9A # 0.000056410769 33
+ .long 0x38777BCD # 0.000059004688 34
+ .long 0x38A6CED4 # 0.000079540216 35
+ .long 0x38FBE3CD # 0.000120110439 36
+ .long 0x387E7E01 # 0.000060675669 37
+ .long 0x37D40984 # 0.000025276800 38
+ .long 0x3784C3AD # 0.000015826745 39
+ .long 0x380F5FAF # 0.000034182969 40
+ .long 0x38AC47BC # 0.000082149607 41
+ .long 0x392952D3 # 0.000161479504 42
+ .long 0x37F97073 # 0.000029735476 43
+ .long 0x3865C84A # 0.000054784388 44
+ .long 0x3979CF17 # 0.000238236375 45
+ .long 0x38C3D2F5 # 0.000093376184 46
+ .long 0x38E6B468 # 0.000110008579 47
+ .long 0x383EBCE1 # 0.000045475437 48
+ .long 0x39186BDF # 0.000145360347 49
+ .long 0x392F0945 # 0.000166927537 50
+ .long 0x38E9ED45 # 0.000111545007 51
+ .long 0x396B99A8 # 0.000224685878 52
+ .long 0x37A27674 # 0.000019367064 53
+ .long 0x397069AB # 0.000229275480 54
+ .long 0x39013539 # 0.000123222257 55
+ .long 0x3947F423 # 0.000190690669 56
+ .long 0x3945E10E # 0.000188712234 57
+ .long 0x38F85DB0 # 0.000118430122 58
+ .long 0x396C08DC # 0.000225100142 59
+ .long 0x37B4996F # 0.000021529120 60
+ .long 0x397CEADA # 0.000241200818 61
+ .long 0x3920261B # 0.000152729845 62
+ .long 0x35AA4906 # 0.000001268724 63
+ .long 0x3805FDF4 # 0.000031946183 64
+ .long 0 # for alignment
+
+
diff --git a/src/gas/vrs8logf.S b/src/gas/vrs8logf.S
new file mode 100644
index 0000000..a5e7ed9
--- /dev/null
+++ b/src/gas/vrs8logf.S
@@ -0,0 +1,904 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrs8logf.s
+#
+# A vector implementation of the logf libm function.
+# This routine implemented in single precision. It is slightly
+# less accurate than the double precision version, but it will
+# be better for vectorizing.
+#
+# Prototype:
+#
+# __m128,__m128 __vrs8_logf(__m128 x1, __m128 x2);
+#
+# Computes the natural log of x for eight packed single values.
+# Places the results into xmm0 and xmm1.
+# Returns proper C99 values, but may not raise status flags properly.
+# Less than 1 ulp of error.
+#
+# This array version is basically a unrolling of the by4 scalar single
+# routine. The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+# The scheduling is done by trial and error. The resulting code represents
+# the best time of many variations. It would seem more interleaving could
+# be done, as there is a long stretch of the second computation that is not
+# interleaved. But moving any of this code forward makes the routine
+# slower.
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+# define local variable storage offsets
+.equ p_x,0 # save x
+.equ p_idx,0x010 # xmmword index
+.equ p_z1,0x020 # xmmword index
+.equ p_q,0x030 # xmmword index
+.equ p_corr,0x040 # xmmword index
+.equ p_omask,0x050 # xmmword index
+.equ save_xmm6,0x060 #
+.equ save_rbx,0x070 #
+.equ save_xmm7,0x080 #
+.equ save_xmm8,0x090 #
+.equ save_xmm9,0x0a0 #
+.equ save_xmm10,0x0b0 #
+.equ save_xmm11,0x0c0 #
+.equ save_xmm12,0x0d0 #
+.equ save_xmm13,0x0d0 #
+.equ p_x2,0x0100 # save x
+.equ p_idx2,0x0110 # xmmword index
+.equ p_z12,0x0120 # xmmword index
+.equ p_q2,0x0130 # xmmword index
+
+.equ stack_size,0x0168
+
+
+
+.globl __vrs8_logf
+ .type __vrs8_logf,@function
+__vrs8_logf:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+# check e as a special case
+ movdqa %xmm0,p_x(%rsp) # save x
+ movdqa %xmm1,p_x2(%rsp) # save x
+ movdqa %xmm0,%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movmskps %xmm2,%r9d
+
+ movdqa %xmm1,%xmm12
+ movdqa %xmm1,%xmm9
+ movaps %xmm1,%xmm7
+
+#
+# compute the index into the log tables
+#
+ movdqa %xmm0,%xmm3
+ movaps %xmm0,%xmm1
+ psrld $23,%xmm3
+
+ #
+ # compute the index into the log tables
+ #
+ psrld $23,%xmm9
+ subps .L__real_one(%rip),%xmm7
+ psubd .L__mask_127(%rip),%xmm9
+ subps .L__real_one(%rip),%xmm1
+ psubd .L__mask_127(%rip),%xmm3
+ cvtdq2ps %xmm9,%xmm13 # xexp
+
+ movdqa %xmm12,%xmm9
+ pand .L__real_mant(%rip),%xmm9
+ xor %r8,%r8
+ movdqa %xmm9,%xmm8
+ movaps .L__real_half(%rip),%xmm11 # .5
+ cvtdq2ps %xmm3,%xmm6 # xexp
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+ xor %r8,%r8
+ movdqa %xmm3,%xmm2
+ movaps .L__real_half(%rip),%xmm5 # .5
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrld $16,%xmm3
+ lea .L__np_ln_lead_table(%rip),%rdx
+ movdqa %xmm3,%xmm4
+ psrld $16,%xmm9
+ movdqa %xmm9,%xmm10
+ psrld $1,%xmm9
+ psrld $1,%xmm3
+ paddd .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm4
+ paddd %xmm4,%xmm3
+ cvtdq2ps %xmm3,%xmm1
+ #/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ paddd .L__mask_040(%rip),%xmm9
+ pand .L__mask_001(%rip),%xmm10
+ paddd %xmm10,%xmm9
+ cvtdq2ps %xmm9,%xmm7
+ packssdw %xmm3,%xmm3
+ movq %xmm3,p_idx(%rsp)
+ packssdw %xmm9,%xmm9
+ movq %xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+ movdqa %xmm0,%xmm3
+ orps .L__real_half(%rip),%xmm2
+
+
+ mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128
+ # reduce and get u
+
+
+ subps %xmm1,%xmm2 # f2 = f - f1
+ mulps %xmm2,%xmm5
+ addps %xmm5,%xmm1
+
+ divps %xmm1,%xmm2 # u
+
+ movdqa %xmm12,%xmm9
+ orps .L__real_half(%rip),%xmm8
+
+
+ mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128
+ subps %xmm7,%xmm8 # f2 = f - f1
+ mulps %xmm8,%xmm11
+ addps %xmm11,%xmm7
+
+
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1(%rsp) # save the f1 values
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1+8(%rsp) # save the f1 value
+ divps %xmm7,%xmm8 # u
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12(%rsp) # save the f1 values
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12+8(%rsp) # save the f1 value
+# solve for ln(1+u)
+ movaps %xmm2,%xmm1 # u
+ mulps %xmm2,%xmm2 # u^2
+ movaps %xmm2,%xmm5
+ movaps .L__real_cb3(%rip),%xmm3
+ mulps %xmm2,%xmm3 #Cu2
+ mulps %xmm1,%xmm5 # u^3
+ addps .L__real_cb2(%rip),%xmm3 #B+Cu2
+ movaps %xmm2,%xmm4
+ mulps %xmm5,%xmm4 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm2
+
+ mulps .L__real_cb1(%rip),%xmm5 #Au3
+ addps %xmm5,%xmm1 # u+Au3
+ mulps %xmm3,%xmm4 # u5(B+Cu2)
+
+ lea .L__np_ln_tail_table(%rip),%rdx
+ addps %xmm4,%xmm1 # poly
+
+# recombine
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q(%rsp) # save the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q+8(%rsp) # save the f2 value
+
+ addps p_q(%rsp),%xmm1 #z2 +=q
+
+ movaps p_z1(%rsp),%xmm0 # z1 values
+
+ mulps %xmm6,%xmm2
+ addps %xmm2,%xmm0 #r1
+ mulps .L__real_log2_tail(%rip),%xmm6
+ addps %xmm6,%xmm1 #r2
+ addps %xmm1,%xmm0
+
+
+
+# check for e
+ test $0x0f,%r9d
+ jnz .L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+ xorps %xmm1,%xmm1
+ cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm1,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg
+
+.L__f2:
+## if +inf
+ movaps p_x(%rsp),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf
+.L__f3:
+
+ movaps p_x(%rsp),%xmm3
+ subps .L__real_one(%rip),%xmm3
+ andps .L__real_notsign(%rip),%xmm3
+ cmpps $2,.L__real_threshold(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one
+.L__f4:
+
+# finish the second set of calculations
+
+ # solve for ln(1+u)
+ movaps %xmm8,%xmm7 # u
+ mulps %xmm8,%xmm8 # u^2
+ movaps %xmm8,%xmm11
+
+ movaps .L__real_cb3(%rip),%xmm9
+ mulps %xmm8,%xmm9 #Cu2
+ mulps %xmm7,%xmm11 # u^3
+ addps .L__real_cb2(%rip),%xmm9 #B+Cu2
+ movaps %xmm8,%xmm10
+ mulps %xmm11,%xmm10 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm8
+
+ mulps .L__real_cb1(%rip),%xmm11 #Au3
+ addps %xmm11,%xmm7 # u+Au3
+ mulps %xmm9,%xmm10 # u5(B+Cu2)
+ addps %xmm10,%xmm7 # poly
+
+
+ # recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2(%rsp) # save the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2+8(%rsp) # save the f2 value
+ addps p_q2(%rsp),%xmm7 #z2 +=q
+ movaps p_z12(%rsp),%xmm1 # z1 values
+
+ mulps %xmm13,%xmm8
+ addps %xmm8,%xmm1 #r1
+ mulps .L__real_log2_tail(%rip),%xmm13
+ addps %xmm13,%xmm7 #r2
+ addps %xmm7,%xmm1
+
+ # check e as a special case
+ movaps p_x2(%rsp),%xmm10
+ cmpps $0,.L__real_ef(%rip),%xmm10
+ movmskps %xmm10,%r9d
+ # check for e
+ test $0x0f,%r9d
+ jnz .L__vlogf_e2
+.L__f12:
+
+ # check for negative numbers or zero
+ xorps %xmm7,%xmm7
+ cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm7,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg2
+
+.L__f22:
+ ## if +inf
+ movaps p_x2(%rsp),%xmm9
+ cmpps $0,.L__real_inf(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf2
+.L__f32:
+
+ movaps p_x2(%rsp),%xmm9
+ subps .L__real_one(%rip),%xmm9
+ andps .L__real_notsign(%rip),%xmm9
+ cmpps $2,.L__real_threshold(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one2
+.L__f42:
+
+
+.L__finish:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+.L__vlogf_e:
+ movdqa p_x(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ jmp .L__f1
+
+.L__vlogf_e2:
+ movdqa p_x2(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ jmp .L__f12
+
+ .align 16
+.L__near_one:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm3,p_omask(%rsp) # save ones mask
+ movaps p_x(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+# u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm1
+ divps %xmm2,%xmm1 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm1,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+# u = u + u;
+ addps %xmm1,%xmm1 #u
+ movaps %xmm1,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm1,%xmm5 # Cu
+ movaps %xmm1,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm1
+ mulps %xmm1,%xmm1 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm1 #u6(Cu+Du3)
+ addps %xmm1,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+# return r + r2;
+ addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm0,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+
+ jmp .L__f4
+
+
+ .align 16
+.L__near_one2:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm9,p_omask(%rsp) # save ones mask
+ movaps p_x2(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+ # u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm7
+ divps %xmm2,%xmm7 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+ # correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm7,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+ # u = u + u;
+ addps %xmm7,%xmm7 #u
+ movaps %xmm7,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+ # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm7,%xmm5 # Cu
+ movaps %xmm7,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm7
+ mulps %xmm7,%xmm7 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm7 #u6(Cu+Du3)
+ addps %xmm7,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+ # return r + r2;
+ addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm1,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+
+ jmp .L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+ movdqa %xmm1,%xmm3
+ andps %xmm0,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm1 # setup the nan values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+# check for +/- 0
+ xorps %xmm1,%xmm1
+ cmpps $0,p_x(%rsp),%xmm1 # 0 ?.
+ movmskps %xmm1,%r9d
+ test $0x0f,%r9d
+ jz .L__zn2
+
+ movdqa %xmm1,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+
+.L__zn2:
+# check for NaNs
+ movaps p_x(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x(%rsp),%xmm1 # isolate the NaNs
+ pand %xmm4,%xmm1
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm1,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm1
+ andnps %xmm0,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f2
+
+# handle only +inf log(+inf) = inf
+.L__log_inf:
+ movdqa %xmm3,%xmm1
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps p_x(%rsp),%xmm1 # setup the +inf values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+ jmp .L__f3
+
+
+.L__z_or_neg2:
+ # deal with negatives first
+ movdqa %xmm7,%xmm3
+ andps %xmm1,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm7 # setup the nan values
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ # check for +/- 0
+ xorps %xmm7,%xmm7
+ cmpps $0,p_x2(%rsp),%xmm7 # 0 ?.
+ movmskps %xmm7,%r9d
+ test $0x0f,%r9d
+ jz .L__zn22
+
+ movdqa %xmm7,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+
+.L__zn22:
+ # check for NaNs
+ movaps p_x2(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x2(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x2(%rsp),%xmm7 # isolate the NaNs
+ pand %xmm4,%xmm7
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm7,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm7
+ andnps %xmm1,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f22
+
+ # handle only +inf log(+inf) = inf
+.L__log_inf2:
+ movdqa %xmm9,%xmm7
+ andnps %xmm1,%xmm9 # keep the non-error values
+ andps p_x2(%rsp),%xmm7 # setup the +inf values
+ orps %xmm9,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ jmp .L__f32
+
+
+ .data
+ .align 64
+
+.L__real_zero: .quad 0x00000000000000000 # 1.0
+ .quad 0x00000000000000000
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03c0000003c000000
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+.L__mask_040: .quad 0x00000004000000040 #
+ .quad 0x00000004000000040
+.L__mask_001: .quad 0x00000000100000001 #
+ .quad 0x00000000100000001
+
+
+.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03
+ .quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03
+ .quad 0x03B1249183B124918
+.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04
+ .quad 0x039E401A639E401A6
+.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03
+ .quad 0x03B124A123B124A12
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+
+
+.L__np_ln__table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02
+ .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02
+ .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02
+ .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02
+ .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02
+ .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02
+ .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01
+ .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01
+ .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01
+ .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01
+ .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01
+ .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01
+ .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01
+ .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01
+ .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01
+ .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01
+ .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01
+ .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01
+ .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01
+ .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01
+ .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01
+ .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01
+ .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01
+ .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01
+ .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01
+ .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01
+ .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01
+ .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01
+ .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01
+ .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01
+ .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01
+ .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01
+ .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01
+ .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01
+ .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01
+ .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01
+ .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01
+ .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01
+ .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01
+ .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01
+ .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01
+ .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01
+ .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01
+ .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01
+ .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01
+ .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01
+ .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01
+ .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01
+ .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01
+ .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01
+ .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01
+ .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01
+ .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01
+ .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01
+ .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01
+ .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01
+ .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01
+ .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01
+ .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01
+ .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01
+ .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01
+ .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01
+ .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01
+ .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_lead_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x3C7E0000 # 0.015502929688 1
+ .long 0x3CFC1000 # 0.030769348145 2
+ .long 0x3D3BA000 # 0.045806884766 3
+ .long 0x3D785000 # 0.060623168945 4
+ .long 0x3D9A0000 # 0.075195312500 5
+ .long 0x3DB78000 # 0.089599609375 6
+ .long 0x3DD49000 # 0.103790283203 7
+ .long 0x3DF13000 # 0.117767333984 8
+ .long 0x3E06B000 # 0.131530761719 9
+ .long 0x3E14A000 # 0.145141601563 10
+ .long 0x3E226000 # 0.158569335938 11
+ .long 0x3E2FF000 # 0.171813964844 12
+ .long 0x3E3D5000 # 0.184875488281 13
+ .long 0x3E4A9000 # 0.197814941406 14
+ .long 0x3E579000 # 0.210510253906 15
+ .long 0x3E647000 # 0.223083496094 16
+ .long 0x3E713000 # 0.235534667969 17
+ .long 0x3E7DC000 # 0.247802734375 18
+ .long 0x3E851000 # 0.259887695313 19
+ .long 0x3E8B3000 # 0.271850585938 20
+ .long 0x3E914000 # 0.283691406250 21
+ .long 0x3E974000 # 0.295410156250 22
+ .long 0x3E9D3000 # 0.307006835938 23
+ .long 0x3EA30000 # 0.318359375000 24
+ .long 0x3EA8D000 # 0.329711914063 25
+ .long 0x3EAE8000 # 0.340820312500 26
+ .long 0x3EB43000 # 0.351928710938 27
+ .long 0x3EB9C000 # 0.362792968750 28
+ .long 0x3EBF5000 # 0.373657226563 29
+ .long 0x3EC4D000 # 0.384399414063 30
+ .long 0x3ECA3000 # 0.394897460938 31
+ .long 0x3ECF9000 # 0.405395507813 32
+ .long 0x3ED4E000 # 0.415771484375 33
+ .long 0x3EDA2000 # 0.426025390625 34
+ .long 0x3EDF5000 # 0.436157226563 35
+ .long 0x3EE47000 # 0.446166992188 36
+ .long 0x3EE99000 # 0.456176757813 37
+ .long 0x3EEEA000 # 0.466064453125 38
+ .long 0x3EF3A000 # 0.475830078125 39
+ .long 0x3EF89000 # 0.485473632813 40
+ .long 0x3EFD7000 # 0.494995117188 41
+ .long 0x3F012000 # 0.504394531250 42
+ .long 0x3F039000 # 0.513916015625 43
+ .long 0x3F05F000 # 0.523193359375 44
+ .long 0x3F084000 # 0.532226562500 45
+ .long 0x3F0AA000 # 0.541503906250 46
+ .long 0x3F0CF000 # 0.550537109375 47
+ .long 0x3F0F4000 # 0.559570312500 48
+ .long 0x3F118000 # 0.568359375000 49
+ .long 0x3F13C000 # 0.577148437500 50
+ .long 0x3F160000 # 0.585937500000 51
+ .long 0x3F183000 # 0.594482421875 52
+ .long 0x3F1A7000 # 0.603271484375 53
+ .long 0x3F1C9000 # 0.611572265625 54
+ .long 0x3F1EC000 # 0.620117187500 55
+ .long 0x3F20E000 # 0.628417968750 56
+ .long 0x3F230000 # 0.636718750000 57
+ .long 0x3F252000 # 0.645019531250 58
+ .long 0x3F273000 # 0.653076171875 59
+ .long 0x3F295000 # 0.661376953125 60
+ .long 0x3F2B5000 # 0.669189453125 61
+ .long 0x3F2D6000 # 0.677246093750 62
+ .long 0x3F2F7000 # 0.685302734375 63
+ .long 0x3F317000 # 0.693115234375 64
+ .long 0 # for alignment
+
+.L__np_ln_tail_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x35A8B0FC # 0.000001256848 1
+ .long 0x361B0E78 # 0.000002310522 2
+ .long 0x3631EC66 # 0.000002651266 3
+ .long 0x35C30046 # 0.000001452871 4
+ .long 0x37EBCB0E # 0.000028108738 5
+ .long 0x37528AE5 # 0.000012549314 6
+ .long 0x36DA7496 # 0.000006510479 7
+ .long 0x3783B715 # 0.000015701671 8
+ .long 0x383F3E68 # 0.000045596069 9
+ .long 0x38297C10 # 0.000040408282 10
+ .long 0x3815B666 # 0.000035694240 11
+ .long 0x38183854 # 0.000036292084 12
+ .long 0x38448108 # 0.000046850211 13
+ .long 0x373539E9 # 0.000010801924 14
+ .long 0x3864A740 # 0.000054515200 15
+ .long 0x387BE3CD # 0.000060055219 16
+ .long 0x3803B715 # 0.000031403342 17
+ .long 0x380C36AF # 0.000033429529 18
+ .long 0x3892713A # 0.000069829126 19
+ .long 0x38AE55D6 # 0.000083129547 20
+ .long 0x38A0FDE8 # 0.000076766883 21
+ .long 0x3862BAE1 # 0.000054056643 22
+ .long 0x3798AAD3 # 0.000018199358 23
+ .long 0x38C5E10E # 0.000094356117 24
+ .long 0x382D872E # 0.000041372310 25
+ .long 0x38DEDFAC # 0.000106274470 26
+ .long 0x38481E9B # 0.000047712219 27
+ .long 0x38EBFB5E # 0.000112524940 28
+ .long 0x38783B83 # 0.000059183232 29
+ .long 0x374E1B05 # 0.000012284848 30
+ .long 0x38CA0E11 # 0.000096347307 31
+ .long 0x3891F660 # 0.000069600297 32
+ .long 0x386C9A9A # 0.000056410769 33
+ .long 0x38777BCD # 0.000059004688 34
+ .long 0x38A6CED4 # 0.000079540216 35
+ .long 0x38FBE3CD # 0.000120110439 36
+ .long 0x387E7E01 # 0.000060675669 37
+ .long 0x37D40984 # 0.000025276800 38
+ .long 0x3784C3AD # 0.000015826745 39
+ .long 0x380F5FAF # 0.000034182969 40
+ .long 0x38AC47BC # 0.000082149607 41
+ .long 0x392952D3 # 0.000161479504 42
+ .long 0x37F97073 # 0.000029735476 43
+ .long 0x3865C84A # 0.000054784388 44
+ .long 0x3979CF17 # 0.000238236375 45
+ .long 0x38C3D2F5 # 0.000093376184 46
+ .long 0x38E6B468 # 0.000110008579 47
+ .long 0x383EBCE1 # 0.000045475437 48
+ .long 0x39186BDF # 0.000145360347 49
+ .long 0x392F0945 # 0.000166927537 50
+ .long 0x38E9ED45 # 0.000111545007 51
+ .long 0x396B99A8 # 0.000224685878 52
+ .long 0x37A27674 # 0.000019367064 53
+ .long 0x397069AB # 0.000229275480 54
+ .long 0x39013539 # 0.000123222257 55
+ .long 0x3947F423 # 0.000190690669 56
+ .long 0x3945E10E # 0.000188712234 57
+ .long 0x38F85DB0 # 0.000118430122 58
+ .long 0x396C08DC # 0.000225100142 59
+ .long 0x37B4996F # 0.000021529120 60
+ .long 0x397CEADA # 0.000241200818 61
+ .long 0x3920261B # 0.000152729845 62
+ .long 0x35AA4906 # 0.000001268724 63
+ .long 0x3805FDF4 # 0.000031946183 64
+ .long 0 # for alignment
+
+
diff --git a/src/gas/vrsacosf.S b/src/gas/vrsacosf.S
new file mode 100644
index 0000000..1620009
--- /dev/null
+++ b/src/gas/vrsacosf.S
@@ -0,0 +1,2291 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsacosf.s
+#
+# A vector implementation of the cos libm function.
+#
+# Prototype:
+#
+# vrsa_cosf(int n, float* x, float* y);
+#
+# Computes Cosine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This inlines a routine that computes 4 single precision Cosine values at a time.
+# The four values are passed as packed single in xmm10.
+# The four results are returned as packed singles in xmm10.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+
+.Lcosarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03FA5555555502F31
+ .quad 0x0BF56C16BF55699D7 # -0.00138889 c2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4
+ .quad 0x0BE92524743CC46B8
+
+.Lsinarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BFC555555545E87D
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x03F811110DF01232D
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x0BE92524743CC46B8
+
+.Lcossinarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BF56C16BF55699D7 # c2
+ .quad 0x03F811110DF01232D
+ .quad 0x03EFA015C50A93B49 # c3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x0BE92524743CC46B8 # c4
+ .quad 0x03EC6DBE4AD1572D5
+
+
+.align 8
+ .Levencos_oddsin_tbl:
+ .quad .Lcoscos_coscos_piby4 # 0 * ; Done
+ .quad .Lcoscos_cossin_piby4 # 1 + ; Done
+ .quad .Lcoscos_sincos_piby4 # 2 ; Done
+ .quad .Lcoscos_sinsin_piby4 # 3 + ; Done
+
+ .quad .Lcossin_coscos_piby4 # 4 ; Done
+ .quad .Lcossin_cossin_piby4 # 5 * ; Done
+ .quad .Lcossin_sincos_piby4 # 6 ; Done
+ .quad .Lcossin_sinsin_piby4 # 7 ; Done
+
+ .quad .Lsincos_coscos_piby4 # 8 ; Done
+ .quad .Lsincos_cossin_piby4 # 9 ; TBD
+ .quad .Lsincos_sincos_piby4 # 10 * ; Done
+ .quad .Lsincos_sinsin_piby4 # 11 ; Done
+
+ .quad .Lsinsin_coscos_piby4 # 12 ; Done
+ .quad .Lsinsin_cossin_piby4 # 13 + ; Done
+ .quad .Lsinsin_sincos_piby4 # 14 ; Done
+ .quad .Lsinsin_sinsin_piby4 # 15 * ; Done
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .weak vrsa_cosf_
+ .set vrsa_cosf_,__vrsa_cosf__
+ .weak vrsa_cosf__
+ .set vrsa_cosf__,__vrsa_cosf__
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#FORTRAN subroutine implementation of array cos
+#VRSA_COSF(N,X,Y)
+#C equivalent*/
+#void vrsa_cosf__(int * n, double *x, double *y)
+#{
+# vrsa_cosf(*n,x,y);
+#}
+
+.globl __vrsa_cosf__
+ .type __vrsa_cosf__,@function
+__vrsa_cosf__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ p_temp,0 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for get/put bits operation
+
+.equ save_xmm6,0x20 # temporary for get/put bits operation
+.equ save_xmm7,0x30 # temporary for get/put bits operation
+.equ save_xmm8,0x40 # temporary for get/put bits operation
+.equ save_xmm9,0x50 # temporary for get/put bits operation
+.equ save_xmm0,0x60 # temporary for get/put bits operation
+.equ save_xmm11,0x70 # temporary for get/put bits operation
+.equ save_xmm12,0x80 # temporary for get/put bits operation
+.equ save_xmm13,0x90 # temporary for get/put bits operation
+.equ save_xmm14,0x0A0 # temporary for get/put bits operation
+.equ save_xmm15,0x0B0 # temporary for get/put bits operation
+
+.equ r,0x0C0 # pointer to r for remainder_piby2
+.equ rr,0x0D0 # pointer to r for remainder_piby2
+.equ region,0x0E0 # pointer to r for remainder_piby2
+
+.equ r1,0x0F0 # pointer to r for remainder_piby2
+.equ rr1,0x0100 # pointer to r for remainder_piby2
+.equ region1,0x0110 # pointer to r for remainder_piby2
+
+.equ p_temp2,0x0120 # temporary for get/put bits operation
+.equ p_temp3,0x0130 # temporary for get/put bits operation
+
+.equ p_temp4,0x0140 # temporary for get/put bits operation
+.equ p_temp5,0x0150 # temporary for get/put bits operation
+
+.equ p_original,0x0160 # original x
+.equ p_mask,0x0170 # original x
+.equ p_sign,0x0180 # original x
+
+.equ p_original1,0x0190 # original x
+.equ p_mask1,0x01A0 # original x
+.equ p_sign1,0x01B0 # original x
+
+.equ save_r12,0x01C0 # temporary for get/put bits operation
+.equ save_r13,0x01D0 # temporary for get/put bits operation
+
+.equ save_xa,0x01E0 #qword
+.equ save_ya,0x01F0 #qword
+
+.equ save_nv,0x0200 #qword
+.equ p_iter,0x0210 # qword storage for number of loop iterations
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.globl vrsa_cosf
+ .type vrsa_cosf,@function
+vrsa_cosf:
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# parameters are passed in by Linux as:
+# rcx - int n
+# rdx - double *x
+# r8 - double *y
+
+
+ sub $0x0228,%rsp
+ mov %r12,save_r12(%rsp) # save r12
+ mov %r13,save_r13(%rsp) # save r13
+
+
+
+
+
+#START PROCESS INPUT
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vrsa_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrsa_top:
+# build the input _m128d
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlps (%rsi),%xmm0
+ movhps 8(%rsi),%xmm0
+
+ prefetch 32(%rsi)
+ add $16,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# V4 START
+
+ movhlps %xmm0,%xmm8
+ cvtps2pd %xmm0,%xmm10 # convert input to double.
+ cvtps2pd %xmm8,%xmm1 # convert input to double.
+
+movdqa %xmm10,%xmm6
+movdqa %xmm1,%xmm7
+movapd .L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd %xmm2,%xmm10 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+movd %xmm10,%rax #rax is lower arg
+movhpd %xmm10, p_temp+8(%rsp) #
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+
+movd %xmm1,%r8 #r8 is lower arg
+movhpd %xmm1, p_temp1+8(%rsp) #
+mov p_temp1+8(%rsp),%r9 #r9 = upper arg
+
+movdqa %xmm10,%xmm12
+movdqa %xmm1,%xmm13
+
+pcmpgtd %xmm6,%xmm12
+pcmpgtd %xmm7,%xmm13
+movdqa %xmm12,%xmm6
+movdqa %xmm13,%xmm7
+psrldq $4,%xmm12
+psrldq $4,%xmm13
+psrldq $8,%xmm6
+psrldq $8,%xmm7
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use +
+
+por %xmm6,%xmm12
+por %xmm7,%xmm13
+
+movapd %xmm10,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm10,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm10
+ mulpd %xmm10,%xmm2 # * twobypi
+ mulpd %xmm10,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%r8 # Region
+ movd %xmm5,%r9 # Region
+
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+ mov %r8,%r10
+ mov %r9,%r11
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail, r9 = region, r11 = region, r13 = Sign
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ xor %rax,%r10
+ xor %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm10,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2 # move r for r2
+ movapd %xmm1,%xmm3 # move r for r2
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+ leaq .Levencos_oddsin_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+ movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm10
+ subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail)
+ subsd %xmm10,%xmm6 # rr=rhead-r
+ subsd %xmm0,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm10,r+8(%rsp) # store upper r
+ movlpd %xmm6,rr+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_lower_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_cosf_lower_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+ movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+ movd %xmm10,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrs4_cosf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm6,%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrs4_cosf_upper_naninf_of_both_gt_5e5:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+
+ subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail)
+
+ movlpd %xmm6,r(%rsp) # store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_upper_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_cosf_upper_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm5,region1(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+ jmp .L__vrs4_cosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+ mov $0x411E848000000000,%r10 #5e5 +
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+
+ subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail)
+
+ movlpd %xmm7,r1+8(%rsp) # store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_cosf_lower_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_cosf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+ movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm1,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+
+ jmp 0f
+
+.L__vrs4_cosf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm7,%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_cosf_upper_naninf_of_both_gt_5e5_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) #region = 0
+
+.align 16
+0:
+
+ jmp .L__vrs4_cosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+
+ subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail)
+
+ movlpd %xmm7,r1(%rsp) # store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_cosf_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_cosf_upper_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_cosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_cosf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd r(%rsp),%xmm10
+ movapd r1(%rsp),%xmm1
+
+ mov region(%rsp),%r8
+ mov region1(%rsp),%r9
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+
+ mov %r8,%r10
+ mov %r9,%r11
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ xor %rax,%r10
+ xor %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+ leaq .Levencos_oddsin_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrsa_cosf_cleanup:
+
+ movapd p_sign(%rsp),%xmm10
+ movapd p_sign1(%rsp),%xmm1
+ xorpd %xmm4,%xmm10 # (+) Sign
+ xorpd %xmm5,%xmm1 # (+) Sign
+
+ cvtpd2ps %xmm10,%xmm0
+ cvtpd2ps %xmm1,%xmm11
+ movlhps %xmm11,%xmm0
+
+# NEW
+
+.L__vrsa_bottom1:
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movlps %xmm0,(%rdi)
+ movhps %xmm0,8(%rdi)
+
+ prefetch 32(%rdi)
+ add $16,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vrsa_top
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vrsa_cleanup
+
+.L__final_check:
+
+# NEW
+
+ mov save_r12(%rsp),%r12 # restore r12
+ mov save_r13(%rsp),%r13 # restore r13
+
+ add $0x0228,%rsp
+ ret
+
+#NEW
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# we assume that rdx is pointing at the next x array element, r8 at the next y array element.
+# The number of values left is in save_nv
+
+.align 16
+.L__vrsa_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+
+# START WORKING FROM HERE
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorps %xmm0,%xmm0
+ movss %xmm0,p_temp+4(%rsp)
+ movlps %xmm0,p_temp+8(%rsp)
+
+
+ mov (%rsi),%ecx # we know there's at least one
+ mov %ecx,p_temp(%rsp)
+ cmp $2,%rax
+ jl .L__vrsacg
+
+ mov 4(%rsi),%ecx # do the second value
+ mov %ecx,p_temp+4(%rsp)
+ cmp $3,%rax
+ jl .L__vrsacg
+
+ mov 8(%rsi),%ecx # do the third value
+ mov %ecx,p_temp+8(%rsp)
+
+.L__vrsacg:
+ mov $4,%rdi # parameter for N
+ lea p_temp(%rsp),%rsi # &x parameter
+ lea p_temp2(%rsp),%rdx # &y parameter
+ call vrsa_cosf@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+
+ mov p_temp2(%rsp),%ecx
+ mov %ecx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vrsacgf
+
+ mov p_temp2+4(%rsp),%ecx
+ mov %ecx,4(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vrsacgf
+
+ mov p_temp2+8(%rsp),%ecx
+ mov %ecx,8(%rdi) # do the third value
+
+.L__vrsacgf:
+ jmp .L__final_check
+
+#NEW
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+ movapd %xmm2,%xmm0 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm2,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm0,%xmm4 # + t
+ subpd %xmm11,%xmm5 # + t
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # s4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # s4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # s2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # s4+x2s3
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # s4+x2s3
+ addpd .Lsincosarray(%rip),%xmm8 # s2+x2s1
+ addpd .Lsincosarray(%rip),%xmm9 # s2+x2s1
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm0,%xmm0 # move high x4 for cos term
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd %xmm10,%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos
+ movhlps %xmm3,%xmm13 # move high r for cos
+
+ movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos
+ movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm7,%xmm5 # sin *x3
+
+ mulsd %xmm0,%xmm8 # cos *x4
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0
+
+ addsd %xmm10,%xmm4 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+ subsd %xmm12,%xmm8 # cos+t
+ subsd %xmm13,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:
+
+ movapd .Lsincosarray+0x30(%rip),%xmm4 # s4
+ movapd .Lcossinarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lsincosarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm3,%xmm7 # sincos term upper x2 for x3
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # s3+x2s4
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # s3+x2s4
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2s2
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2s2
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm0,%xmm0 # move high x4 for cos term
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin)
+ mulpd %xmm1,%xmm7
+
+ mulsd %xmm10,%xmm6 # get low x3 for sin term (cossin)
+ movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+
+ movhlps %xmm2,%xmm12 # move high r for cos (cossin)
+
+
+ movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin)
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos)
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm11,%xmm5 # cos *x4
+ mulsd %xmm0,%xmm8 # cos *x4
+ mulsd %xmm7,%xmm9 # sin *x3
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0
+
+ movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos)
+
+ addsd %xmm10,%xmm4 # sin + x +
+ addsd %xmm11,%xmm9 # sin + x +
+
+ subsd %xmm12,%xmm8 # cos+t
+ subsd %xmm3,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd .Lcossinarray+0x30(%rip),%xmm4 # s4
+ movapd .Lcossinarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm2,%xmm6 # move x2 for x4
+ movapd %xmm3,%xmm7 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # s4+x2s3
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # s4+x2s3
+ addpd .Lcossinarray(%rip),%xmm8 # s2+x2s1
+ addpd .Lcossinarray(%rip),%xmm9 # s2+x2s1
+
+ mulpd %xmm0,%xmm4 # x4(s4+x2s3)
+ mulpd %xmm11,%xmm5 # x4(s4+x2s3)
+
+ mulpd %xmm10,%xmm6 # get low x3 for sin term
+ mulpd %xmm1,%xmm7 # get low x3 for sin term
+ movhlps %xmm6,%xmm6 # move low x2 for x3 for sin term
+ movhlps %xmm7,%xmm7 # move low x2 for x3 for sin term
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos terms
+ mulsd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm4,%xmm12 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm13 # xmm9 = sin , xmm5 = cos
+
+ mulsd %xmm6,%xmm12 # sin *x3
+ mulsd %xmm7,%xmm13 # sin *x3
+ mulsd %xmm0,%xmm4 # cos *x4
+ mulsd %xmm11,%xmm5 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0
+
+ movhlps %xmm10,%xmm0 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x for sin term
+ # Reverse 10 and 0
+
+ addsd %xmm0,%xmm12 # sin + x
+ addsd %xmm11,%xmm13 # sin + x
+
+ subsd %xmm2,%xmm4 # cos+t
+ subsd %xmm3,%xmm5 # cos+t
+
+ movlhps %xmm12,%xmm4
+ movlhps %xmm13,%xmm5
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+
+ movapd .Lcossinarray+0x30(%rip),%xmm4 # s4
+ movapd .Lsincosarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lsincosarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm2,%xmm7 # upper x2 for x3 for sin term (sincos)
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # s3+x2s4
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # s3+x2s4
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2s2
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2s2
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+
+ movsd %xmm3,%xmm6 # move low x2 for x3 for sin term (cossin)
+ mulpd %xmm10,%xmm7
+
+ mulsd %xmm1,%xmm6 # get low x3 for sin term (cossin)
+ movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos)
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm3,%xmm12 # move high r for cos (cossin)
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos (sincos)
+ movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin (cossin)
+
+ mulsd %xmm0,%xmm4 # cos *x4
+ mulsd %xmm6,%xmm5 # sin *x3
+ mulsd %xmm7,%xmm8 # sin *x3
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0
+
+ movhlps %xmm10,%xmm11 # move high x for x for sin term (sincos)
+
+ subsd %xmm2,%xmm4 # cos-(-t)
+ subsd %xmm12,%xmm9 # cos-(-t)
+
+ addsd %xmm11,%xmm8 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrsa_cosf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: SIN
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: COS
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm0 # x2 ; SIN
+ movapd %xmm3,%xmm11 # x2 ; COS
+ movapd %xmm3,%xmm1 # copy of x2 for x4
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm0 # x4
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm3,%xmm1 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+
+ mulpd %xmm0,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm1,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm1,%xmm5 # x4 * zc
+
+ addpd %xmm10,%xmm4 # +x
+ subpd %xmm11,%xmm5 # +t
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: COS
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: SIN
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm0 # x2 ; COS
+ movapd %xmm3,%xmm11 # x2 ; SIN
+ movapd %xmm2,%xmm10 # copy of x2 for x4
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # s4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # s2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # s4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # s2*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # s3+x2c4
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # s1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm10,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm10,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x3 * zc
+
+ subpd %xmm0,%xmm4 # +t
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrsa_cosf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4: #Derive from cossin_coscos
+ movhlps %xmm2,%xmm0 # x2 for 0.5x2 for upper cos
+ movsd %xmm2,%xmm6 # lower x2 for x3 for lower sin
+ movapd %xmm3,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lsincosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcosarray(%rip),%xmm9 # c2+x2c1
+
+ movapd %xmm12,%xmm2 # upper=x4
+ movsd %xmm6,%xmm2 # lower=x2
+ mulsd %xmm10,%xmm2 # lower=x2*x
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # upper= x4 * zc
+ # lower=x3 * zs
+ mulpd %xmm13,%xmm5 # x4 * zc
+
+ movlhps %xmm7,%xmm10 #
+ addpd %xmm10,%xmm4 # +x for lower sin, +t for upper cos
+ subpd %xmm11,%xmm5 # -(-t)
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4: #Derive from cossin_coscos
+ movsd %xmm2,%xmm0 # x2 for 0.5x2 for lower cos
+ movapd %xmm3,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcossinarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcossinarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcossinarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcosarray(%rip),%xmm9 # c2+x2c1
+
+ mulpd %xmm10,%xmm2 # upper=x3 for sin
+ mulsd %xmm10,%xmm2 # lower=x4 for cos
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # lower= x4 * zc
+ # upper= x3 * zs
+ mulpd %xmm13,%xmm5 # x4 * zc
+
+ movsd %xmm7,%xmm10
+ addpd %xmm10,%xmm4 # +x for upper sin, +t for lower cos
+ subpd %xmm11,%xmm5 # -(-t)
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+ movhlps %xmm3,%xmm0 # x2 for 0.5x2 for upper cos
+ movapd %xmm2,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd %xmm3,%xmm6 # lower x2 for x3 for sin
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lsincosarray(%rip),%xmm9 # c2+x2c1
+
+ movapd %xmm13,%xmm3 # upper=x4
+ movsd %xmm6,%xmm3 # lower x2
+ mulsd %xmm1,%xmm3 # lower x2*x
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm12,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # upper= x4 * zc
+ # lower=x3 * zs
+
+ movlhps %xmm7,%xmm1
+ addpd %xmm1,%xmm5 # +x for lower sin, +t for upper cos
+ subpd %xmm11,%xmm4 # -(-t)
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4: # Derived from sincos_coscos
+
+ movhlps %xmm3,%xmm0 # x2
+ movapd %xmm3,%xmm7
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsincosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+ movapd %xmm13,%xmm3 # upper x4 for cos
+ movsd %xmm7,%xmm3 # lower x2 for sin
+ mulsd %xmm1,%xmm3 # lower x3=x2*x for sin
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm11,%xmm1 # t for upper cos and x for lower sin
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # upper=x4 * zc
+ # lower=x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +t upper, +x lower
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lsincos_coscos_piby4:
+ movsd %xmm3,%xmm0 # x2 for 0.5x2 for lower cos
+ movapd %xmm2,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lcossinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcossinarray(%rip),%xmm9 # c2+x2c1
+
+ mulpd %xmm1,%xmm3 # upper=x3 for sin
+ mulsd %xmm1,%xmm3 # lower=x4 for cos
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm12,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # lower= x4 * zc
+ # upper= x3 * zs
+
+ movsd %xmm7,%xmm1
+ subpd %xmm11,%xmm4 # -(-t)
+ addpd %xmm1,%xmm5 # +x for upper sin, +t for lower cos
+
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4: # Derived from sincos_coscos
+
+ movsd %xmm3,%xmm0 # x2
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcossinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcossinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm1,%xmm6
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm6,%xmm11 # upper =t ; lower =x
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm11,%xmm5 # +t lower, +x upper
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4: # Derived from sincos_coscos
+
+ movhlps %xmm2,%xmm0 # x2
+ movapd %xmm2,%xmm7
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsincosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ movapd %xmm12,%xmm2 # upper x4 for cos
+ movsd %xmm7,%xmm2 # lower x2 for sin
+ mulsd %xmm10,%xmm2 # lower x3=x2*x for sin
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm11,%xmm10 # t for upper cos and x for lower sin
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # upper=x4 * zc
+ # lower=x3 * zs
+
+ addpd %xmm1,%xmm5 # +x
+ addpd %xmm10,%xmm4 # +t upper, +x lower
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4: # Derived from sincos_coscos
+
+ movsd %xmm2,%xmm0 # x2
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lcossinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcossinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lcossinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm10,%xmm2 # upper x3 for sin
+ mulsd %xmm10,%xmm2 # lower x4 for cos
+
+ movhlps %xmm10,%xmm6
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm6,%xmm11
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ addpd %xmm1,%xmm5 # +x
+ addpd %xmm11,%xmm4 # +t lower, +x upper
+
+ jmp .L__vrsa_cosf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ #x2 = x * x;
+ #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))));
+
+ #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4));
+
+
+ movapd %xmm2,%xmm0 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ mulpd %xmm10,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # x3
+
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm0,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm11,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrsa_cosf_cleanup
diff --git a/src/gas/vrsaexpf.S b/src/gas/vrsaexpf.S
new file mode 100644
index 0000000..399943e
--- /dev/null
+++ b/src/gas/vrsaexpf.S
@@ -0,0 +1,766 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsaexpf.s
+#
+# An array implementation of the expf libm function.
+#
+# Prototype:
+#
+# void vrsa_expf(int n, float *x, float *y);
+#
+# Computes e raised to the x power for an array of input values.
+# Places the results into the supplied y array.
+# This routine implemented in single precision. It is slightly
+# less accurate than the double precision version, but it will
+# be better for vectorizing.
+# Does not perform error handling, but does return C99 values for error
+# inputs. Denormal results are truncated to 0.
+
+# This array version is basically a unrolling of the by4 scalar single
+# routine. The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+# The scheduling is done by trial and error. The resulting code represents
+# the best time of many variations. It would seem more interleaving could
+# be done, as there is a long stretch of the second computation that is not
+# interleaved. But moving any of this code forward makes the routine
+# slower.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ p_ux,0x00 #qword
+.equ p_ux2,0x010 #qword
+
+.equ save_xa,0x020 #qword
+.equ save_ya,0x028 #qword
+.equ save_nv,0x030 #qword
+
+
+.equ p_iter,0x038 # qword storage for number of loop iterations
+
+.equ p_j,0x040 # second temporary for get/put bits operation
+.equ p_m,0x050 #qword
+.equ p_j2,0x060 # second temporary for exponent multiply
+.equ p_m2,0x070 #qword
+.equ save_rbx,0x080 #qword
+
+
+.equ stack_size,0x098
+
+ .weak vrsa_expf_
+ .set vrsa_expf_,__vrsa_expf__
+ .weak vrsa_expf__
+ .set vrsa_expf__,__vrsa_expf__
+
+# parameters are passed in by gcc as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array expf
+#** VRSA_EXPF(N,X,Y)
+# C equivalent*/
+#void vrsa_expf__(int * n, float *x, float *y)
+#{
+# vrsa_expf(*n,x,y);
+#}
+.globl __vrsa_expf__
+ .type __vrsa_expf__,@function
+__vrsa_expf__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+.globl vrsa_expf
+ .type vrsa_expf,@function
+vrsa_expf:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp)
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+ mov %rdi,save_nv(%rsp) # save number of values
+
+# see if too few values to call the main loop
+ shr $3,%rax # get number of iterations
+ jz .L__vsa_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $3,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 8 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+ movaps .L__real_thirtytwo_by_log2(%rip),%xmm3 #
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movups (%rsi),%xmm0
+ movups 16(%rsi),%xmm6
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+ movaps %xmm0,p_ux(%rsp)
+ maxps .L__real_m8192(%rip),%xmm0
+ movaps %xmm6,p_ux2(%rsp)
+ maxps .L__real_m8192(%rip),%xmm6
+
+
+# /* Find m, z1 and z2 such that exp(x) = 2**m * (z1 + z2) */
+# Step 1. Reduce the argument.
+ # r = x * thirtytwo_by_logbaseof2;
+ movaps .L__real_thirtytwo_by_log2(%rip),%xmm2 #
+
+ mulps %xmm0,%xmm2
+ xor %rax,%rax
+ minps .L__real_8192(%rip),%xmm2
+ movaps .L__real_thirtytwo_by_log2(%rip),%xmm5 #
+
+ mulps %xmm6,%xmm5
+ minps .L__real_8192(%rip),%xmm5 # protect against large input values
+
+
+# /* Set n = nearest integer to r */
+ cvtps2dq %xmm2,%xmm3
+ lea .L__two_to_jby32_table(%rip),%rdi
+ cvtdq2ps %xmm3,%xmm1
+
+ cvtps2dq %xmm5,%xmm8
+ cvtdq2ps %xmm8,%xmm7
+# r1 = x - n * logbaseof2_by_32_lead;
+ movaps .L__real_log2_by_32_head(%rip),%xmm2
+ mulps %xmm1,%xmm2
+ subps %xmm2,%xmm0 # r1 in xmm0,
+
+ movaps .L__real_log2_by_32_head(%rip),%xmm5
+ mulps %xmm7,%xmm5
+ subps %xmm5,%xmm6 # r1 in xmm6,
+
+
+# r2 = - n * logbaseof2_by_32_lead;
+ mulps .L__real_log2_by_32_tail(%rip),%xmm1
+ mulps .L__real_log2_by_32_tail(%rip),%xmm7
+
+# j = n & 0x0000001f;
+ movdqa %xmm3,%xmm4
+ movdqa .L__int_mask_1f(%rip),%xmm2
+ movdqa %xmm8,%xmm9
+ movdqa .L__int_mask_1f(%rip),%xmm5
+ pand %xmm4,%xmm2
+ movdqa %xmm2,p_j(%rsp)
+# f1 = two_to_jby32_lead_table[j);
+
+ pand %xmm9,%xmm5
+ movdqa %xmm5,p_j2(%rsp)
+
+# *m = (n - j) / 32;
+ psubd %xmm2,%xmm4
+ psrad $5,%xmm4
+ movdqa %xmm4,p_m(%rsp)
+ psubd %xmm5,%xmm9
+ psrad $5,%xmm9
+ movdqa %xmm9,p_m2(%rsp)
+
+ movaps %xmm0,%xmm3
+ addps %xmm1,%xmm3 # r = r1+ r2
+
+ mov p_j(%rsp),%eax # get an individual index
+ movaps %xmm6,%xmm8
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ addps %xmm7,%xmm8 # r = r1+ r2
+ mov %edx,p_j(%rsp) # save the f1 value
+
+# Step 2. Compute the polynomial.
+# q = r1 +
+# r*r*( 5.00000000000000008883e-01 +
+# r*( 1.66666666665260878863e-01 +
+# r*( 4.16666666662260795726e-02 +
+# r*( 8.33336798434219616221e-03 +
+# r*( 1.38889490863777199667e-03 )))));
+# q = r + r^2/2 + r^3/6 + r^4/24 + r^5/120 + r^6/720
+# q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision
+ movaps %xmm3,%xmm4
+ movaps %xmm3,%xmm2
+ mulps %xmm2,%xmm2 # x*x
+ mulps .L__real_1_24(%rip),%xmm4 # /24
+
+ mov p_j+4(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j+4(%rsp) # save the f1 value
+
+ addps .L__real_1_6(%rip),%xmm4 # +1/6
+
+ mulps %xmm2,%xmm3 # x^3
+ mov p_j+8(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j+8(%rsp) # save the f1 value
+ mulps .L__real_half(%rip),%xmm2 # x^2/2
+ mov p_j+12(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j+12(%rsp) # save the f1 value
+ mulps %xmm3,%xmm4 # *x^3
+ mov p_j2(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j2(%rsp) # save the f1 value
+
+ addps %xmm4,%xmm1 # +r2
+
+ addps %xmm2,%xmm1 # + x^2/2
+ addps %xmm1,%xmm0 # +r1
+
+ movaps %xmm8,%xmm9
+ mov p_j2+4(%rsp),%eax # get an individual index
+ movaps %xmm8,%xmm5
+ mulps %xmm5,%xmm5 # x*x
+ mulps .L__real_1_24(%rip),%xmm9 # /24
+
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j2+4(%rsp) # save the f1 value
+
+# deal with infinite or denormal results
+ movdqa p_m(%rsp),%xmm1
+ movdqa p_m(%rsp),%xmm2
+ pcmpgtd .L__int_127(%rip),%xmm2
+ pminsw .L__int_128(%rip),%xmm1 # ceil at 128
+ movmskps %xmm2,%eax
+ test $0x0f,%eax
+
+ paddd .L__int_127(%rip),%xmm1 # add bias
+
+# *z2 = f2 + ((f1 + f2) * q);
+ mulps p_j(%rsp),%xmm0 # * f1
+ addps p_j(%rsp),%xmm0 # + f1
+ jnz .L__exp_largef
+.L__check1:
+
+
+ pxor %xmm2,%xmm2 # floor at 0
+ pmaxsw %xmm2,%xmm1
+
+ pslld $23,%xmm1 # build 2^n
+
+ movaps %xmm1,%xmm2
+
+
+
+# check for infinity or nan
+ movaps p_ux(%rsp),%xmm1
+ andps .L__real_infinity(%rip),%xmm1
+ cmpps $0,.L__real_infinity(%rip),%xmm1
+ movmskps %xmm1,%ebx
+ test $0x0f,%ebx
+
+
+# end of splitexp
+# /* Scale (z1 + z2) by 2.0**m */
+# Step 3. Reconstitute.
+
+ mulps %xmm2,%xmm0 # result *= 2^n
+
+# we'd like to avoid a branch, and can use cmp's and and's to
+# eliminate them. But it adds cycles for normal cases
+# to handle events that are supposed to be exceptions.
+# Using this branch with the
+# check above results in faster code for the normal cases.
+# And branch mispredict penalties should only come into
+# play for nans and infinities.
+ jnz .L__exp_naninf
+.L__vsa_bottom1:
+
+ # q = r + r^2/2 + r^3/6 + r^4/24 good enough for single precision
+ addps .L__real_1_6(%rip),%xmm9 # +1/6
+
+ mulps %xmm5,%xmm8 # x^3
+ mov p_j2+8(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j2+8(%rsp) # save the f1 value
+ mulps .L__real_half(%rip),%xmm5 # x^2/2
+ mulps %xmm8,%xmm9 # *x^3
+
+ mov p_j2+12(%rsp),%eax # get an individual index
+ mov (%rdi,%rax,4),%edx # get the f1 value
+ mov %edx,p_j2+12(%rsp) # save the f1 value
+ addps %xmm9,%xmm7 # +r2
+
+ addps %xmm5,%xmm7 # + x^2/2
+ addps %xmm7,%xmm6 # +r1
+
+
+ # deal with infinite or denormal results
+ movdqa p_m2(%rsp),%xmm7
+ movdqa p_m2(%rsp),%xmm5
+ pcmpgtd .L__int_127(%rip),%xmm5
+ pminsw .L__int_128(%rip),%xmm7 # ceil at 128
+ movmskps %xmm5,%eax
+ test $0x0f,%eax
+
+ paddd .L__int_127(%rip),%xmm7 # add bias
+
+ # *z2 = f2 + ((f1 + f2) * q);
+ mulps p_j2(%rsp),%xmm6 # * f1
+ addps p_j2(%rsp),%xmm6 # + f1
+ jnz .L__exp_largef2
+.L__check2:
+
+ pxor %xmm5,%xmm5 # floor at 0
+ pmaxsw %xmm5,%xmm7
+
+ pslld $23,%xmm7 # build 2^n
+
+ movaps %xmm7,%xmm5
+
+
+ # check for infinity or nan
+ movaps p_ux2(%rsp),%xmm7
+ andps .L__real_infinity(%rip),%xmm7
+ cmpps $0,.L__real_infinity(%rip),%xmm7
+ movmskps %xmm7,%ebx
+ test $0x0f,%ebx
+
+
+ # end of splitexp
+ # /* Scale (z1 + z2) by 2.0**m */
+ # Step 3. Reconstitute.
+
+ mulps %xmm5,%xmm6 # result *= 2^n
+#__vsa_bottom1:
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movups %xmm0,(%rdi)
+
+ jnz .L__exp_naninf2
+
+.L__vsa_bottom2:
+
+ prefetch 64(%rdi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movups %xmm6,-16(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vsa_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vsa_cleanup
+
+
+#
+.L__final_check:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+# at least one of the numbers needs special treatment
+.L__exp_naninf:
+ lea p_ux(%rsp),%rcx
+ call .L__fexp_naninf
+ jmp .L__vsa_bottom1
+.L__exp_naninf2:
+ lea p_ux2(%rsp),%rcx
+ movaps %xmm6,%xmm0
+ call .L__fexp_naninf
+ movaps %xmm0,%xmm6
+ jmp .L__vsa_bottom2
+
+# deal with nans and infinities
+# This subroutine checks a packed single for nans and infinities and
+# produces the proper result from the exceptional inputs
+# Register assumptions:
+# Inputs:
+# rbx - mask of errors
+# xmm0 - computed result vector
+# Outputs:
+# xmm0 - new result vector
+# %rax,rdx,rbx,%xmm2 all modified.
+
+.L__fexp_naninf:
+ movaps %xmm0,p_j+8(%rsp) # save the computed values
+ test $1,%ebx # first value?
+ jz .L__Lni2
+ mov 0(%rcx),%edx # get the input
+ call .L__naninf
+ mov %edx,p_j+8(%rsp) # copy the result
+.L__Lni2:
+ test $2,%ebx # second value?
+ jz .L__Lni3
+ mov 4(%rcx),%edx # get the input
+ call .L__naninf
+ mov %edx,p_j+12(%rsp) # copy the result
+.L__Lni3:
+ test $4,%ebx # third value?
+ jz .L__Lni4
+ mov 8(%rcx),%edx # get the input
+ call .L__naninf
+ mov %edx,p_j+16(%rsp) # copy the result
+.L__Lni4:
+ test $8,%ebx # fourth value?
+ jz .L__Lnie
+ mov 12(%rcx),%edx # get the input
+ call .L__naninf
+ mov %edx,p_j+20(%rsp) # copy the result
+.L__Lnie:
+ movaps p_j+8(%rsp),%xmm0 # get the answers
+ ret
+
+#
+# a simple subroutine to check a scalar input value for infinity
+# or NaN and return the correct result
+# expects input in .Land,%edx returns value in edx. Destroys eax.
+.L__naninf:
+ mov $0x0007FFFFF,%eax
+ test %eax,%edx
+ jnz .L__enan # jump if mantissa not zero, so it's a NaN
+# inf
+ mov %edx,%eax
+ rcl $1,%eax
+ jnc .L__r # exp(+inf) = inf
+ xor %edx,%edx # exp(-inf) = 0
+ jmp .L__r
+
+#NaN
+.L__enan:
+ mov $0x000400000,%eax # convert to quiet
+ or %eax,%edx
+.L__r:
+ ret
+
+
+ .align 16
+# we jump here when we have an odd number of exp calls to make at the
+# end
+.L__vsa_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+ xorps %xmm0,%xmm0
+ movaps %xmm0,p_j(%rsp)
+ movaps %xmm0,p_j+16(%rsp)
+
+ mov (%rsi),%ecx # we know there's at least one
+ mov %ecx,p_j(%rsp)
+ cmp $2,%rax
+ jl .L__vsacg
+
+ mov 4(%rsi),%ecx # do the second value
+ mov %ecx,p_j+4(%rsp)
+ cmp $3,%rax
+ jl .L__vsacg
+
+ mov 8(%rsi),%ecx # do the third value
+ mov %ecx,p_j+8(%rsp)
+ cmp $4,%rax
+ jl .L__vsacg
+
+ mov 12(%rsi),%ecx # do the fourth value
+ mov %ecx,p_j+12(%rsp)
+ cmp $5,%rax
+ jl .L__vsacg
+
+ mov 16(%rsi),%ecx # do the fifth value
+ mov %ecx,p_j+16(%rsp)
+ cmp $6,%rax
+ jl .L__vsacg
+
+ mov 20(%rsi),%ecx # do the sixth value
+ mov %ecx,p_j+20(%rsp)
+ cmp $7,%rax
+ jl .L__vsacg
+
+ mov 24(%rsi),%ecx # do the last value
+ mov %ecx,p_j+24(%rsp)
+
+.L__vsacg:
+ mov $8,%rdi # parameter for N
+ lea p_j(%rsp),%rsi # &x parameter
+ lea p_j2(%rsp),%rdx # &y parameter
+ call vrsa_expf@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p_j2(%rsp),%ecx
+ mov %ecx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vsacgf
+
+ mov p_j2+4(%rsp),%ecx
+ mov %ecx,4(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vsacgf
+
+ mov p_j2+8(%rsp),%ecx
+ mov %ecx,8(%rdi) # do the second value
+ cmp $4,%rax
+ jl .L__vsacgf
+
+ mov p_j2+12(%rsp),%ecx
+ mov %ecx,12(%rdi) # do the second value
+ cmp $5,%rax
+ jl .L__vsacgf
+
+ mov p_j2+16(%rsp),%ecx
+ mov %ecx,16(%rdi) # do the second value
+ cmp $6,%rax
+ jl .L__vsacgf
+
+ mov p_j2+20(%rsp),%ecx
+ mov %ecx,20(%rdi) # do the second value
+ cmp $7,%rax
+ jl .L__vsacgf
+
+ mov p_j2+24(%rsp),%ecx
+ mov %ecx,24(%rdi) # do the last value
+
+.L__vsacgf:
+ jmp .L__final_check
+
+ .align 16
+# deal with m > 127. In some instances, rounding during calculations
+# can result in infinity when it shouldn't. For these cases, we scale
+# m down, and scale the mantissa up.
+
+.L__exp_largef:
+ movdqa %xmm0,p_j(%rsp) # save the mantissa portion
+ movdqa %xmm1,p_m(%rsp) # save the exponent portion
+ mov %eax,%ecx # save the error mask
+ test $1,%ecx # first value?
+ jz .L__Lf2
+ mov p_m(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m(%rsp) # save the exponent
+ movss p_j(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j(%rsp) # save the mantissa
+.L__Lf2:
+ test $2,%ecx # second value?
+ jz .L__Lf3
+ mov p_m+4(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m+4(%rsp) # save the exponent
+ movss p_j+4(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+4(%rsp) # save the mantissa
+.L__Lf3:
+ test $4,%ecx # third value?
+ jz .L__Lf4
+ mov p_m+8(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m+8(%rsp) # save the exponent
+ movss p_j+8(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+8(%rsp) # save the mantissa
+.L__Lf4:
+ test $8,%ecx # fourth value?
+ jz .L__Lfe
+ mov p_m+12(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m+12(%rsp) # save the exponent
+ movss p_j+12(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+12(%rsp) # save the mantissa
+.L__Lfe:
+ movaps p_j(%rsp),%xmm0 # restore the mantissa portion back
+ movdqa p_m(%rsp),%xmm1 # restore the exponent portion
+ jmp .L__check1
+ .align 16
+
+.L__exp_largef2:
+ movdqa %xmm6,p_j(%rsp) # save the mantissa portion
+ movdqa %xmm7,p_m2(%rsp) # save the exponent portion
+ mov %eax,%ecx # save the error mask
+ test $1,%ecx # first value?
+ jz .L__Lf22
+ mov p_m2+0(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m2+0(%rsp) # save the exponent
+ movss p_j+0(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+0(%rsp) # save the mantissa
+.L__Lf22:
+ test $2,%ecx # second value?
+ jz .L__Lf32
+ mov p_m2+4(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m2+4(%rsp) # save the exponent
+ movss p_j+4(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+4(%rsp) # save the mantissa
+.L__Lf32:
+ test $4,%ecx # third value?
+ jz .L__Lf42
+ mov p_m2+8(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m2+8(%rsp) # save the exponent
+ movss p_j+8(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+8(%rsp) # save the mantissa
+.L__Lf42:
+ test $8,%ecx # fourth value?
+ jz .L__Lfe2
+ mov p_m2+12(%rsp),%edx # get the exponent
+ sub $1,%edx # scale it down
+ mov %edx,p_m2+12(%rsp) # save the exponent
+ movss p_j+12(%rsp),%xmm3 # get the mantissa
+ mulss .L__real_two(%rip),%xmm3 # scale it up
+ movss %xmm3,p_j+12(%rsp) # save the mantissa
+.L__Lfe2:
+ movaps p_j(%rsp),%xmm6 # restore the mantissa portion back
+ movdqa p_m2(%rsp),%xmm7 # restore the exponent portion
+ jmp .L__check2
+
+
+ .data
+ .align 64
+.L__real_half: .long 0x03f000000 # 1/2
+ .long 0x03f000000
+ .long 0x03f000000
+ .long 0x03f000000
+.L__real_two: .long 0x40000000 # 2
+ .long 0x40000000
+ .long 0x40000000
+ .long 0x40000000
+.L__real_8192: .long 0x46000000 # 8192, to protect against really large numbers
+ .long 0x46000000
+ .long 0x46000000
+ .long 0x46000000
+.L__real_m8192: .long 0xC6000000 # -8192, to protect against really small numbers
+ .long 0xC6000000
+ .long 0xC6000000
+ .long 0xC6000000
+.L__real_thirtytwo_by_log2: .long 0x04238AA3B # thirtytwo_by_log2
+ .long 0x04238AA3B
+ .long 0x04238AA3B
+ .long 0x04238AA3B
+.L__real_log2_by_32: .long 0x03CB17218 # log2_by_32
+ .long 0x03CB17218
+ .long 0x03CB17218
+ .long 0x03CB17218
+.L__real_log2_by_32_head: .long 0x03CB17000 # log2_by_32
+ .long 0x03CB17000
+ .long 0x03CB17000
+ .long 0x03CB17000
+.L__real_log2_by_32_tail: .long 0x0B585FDF4 # log2_by_32
+ .long 0x0B585FDF4
+ .long 0x0B585FDF4
+ .long 0x0B585FDF4
+.L__real_1_6: .long 0x03E2AAAAB # 0.16666666666 used in polynomial
+ .long 0x03E2AAAAB
+ .long 0x03E2AAAAB
+ .long 0x03E2AAAAB
+.L__real_1_24: .long 0x03D2AAAAB # 0.041666668 used in polynomial
+ .long 0x03D2AAAAB
+ .long 0x03D2AAAAB
+ .long 0x03D2AAAAB
+.L__real_1_120: .long 0x03C088889 # 0.0083333338 used in polynomial
+ .long 0x03C088889
+ .long 0x03C088889
+ .long 0x03C088889
+.L__real_infinity: .long 0x07f800000 # infinity
+ .long 0x07f800000
+ .long 0x07f800000
+ .long 0x07f800000
+.L__int_mask_1f: .long 0x00000001f
+ .long 0x00000001f
+ .long 0x00000001f
+ .long 0x00000001f
+.L__int_128: .long 0x000000080
+ .long 0x000000080
+ .long 0x000000080
+ .long 0x000000080
+.L__int_127: .long 0x00000007f
+ .long 0x00000007f
+ .long 0x00000007f
+ .long 0x00000007f
+
+.L__two_to_jby32_table:
+ .long 0x03F800000 # 1.0000000000000000
+ .long 0x03F82CD87 # 1.0218971486541166
+ .long 0x03F85AAC3 # 1.0442737824274138
+ .long 0x03F88980F # 1.0671404006768237
+ .long 0x03F8B95C2 # 1.0905077326652577
+ .long 0x03F8EA43A # 1.1143867425958924
+ .long 0x03F91C3D3 # 1.1387886347566916
+ .long 0x03F94F4F0 # 1.1637248587775775
+ .long 0x03F9837F0 # 1.1892071150027210
+ .long 0x03F9B8D3A # 1.2152473599804690
+ .long 0x03F9EF532 # 1.2418578120734840
+ .long 0x03FA27043 # 1.2690509571917332
+ .long 0x03FA5FED7 # 1.2968395546510096
+ .long 0x03FA9A15B # 1.3252366431597413
+ .long 0x03FAD583F # 1.3542555469368927
+ .long 0x03FB123F6 # 1.3839098819638320
+ .long 0x03FB504F3 # 1.4142135623730951
+ .long 0x03FB8FBAF # 1.4451808069770467
+ .long 0x03FBD08A4 # 1.4768261459394993
+ .long 0x03FC12C4D # 1.5091644275934228
+ .long 0x03FC5672A # 1.5422108254079407
+ .long 0x03FC9B9BE # 1.5759808451078865
+ .long 0x03FCE248C # 1.6104903319492543
+ .long 0x03FD2A81E # 1.6457554781539649
+ .long 0x03FD744FD # 1.6817928305074290
+ .long 0x03FDBFBB8 # 1.7186192981224779
+ .long 0x03FE0CCDF # 1.7562521603732995
+ .long 0x03FE5B907 # 1.7947090750031072
+ .long 0x03FEAC0C7 # 1.8340080864093424
+ .long 0x03FEFE4BA # 1.8741676341103000
+ .long 0x03FF5257D # 1.9152065613971474
+ .long 0x03FFA83B3 # 1.9571441241754002
+ .long 0 # for alignment
+
+
+
diff --git a/src/gas/vrsalog10f.S b/src/gas/vrsalog10f.S
new file mode 100644
index 0000000..003eaf1
--- /dev/null
+++ b/src/gas/vrsalog10f.S
@@ -0,0 +1,1149 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsalogf.s
+#
+# An array implementation of the logf libm function.
+#
+# Prototype:
+#
+# void vrsa_log10f(int n, float *x, float *y);
+#
+# Computes the natural log of x.
+# Places the results into the supplied y array.
+# Does not perform error handling, but does return C99 values for error
+# inputs. Denormal results are truncated to 0.
+
+# This array version is basically a unrolling of the by4 scalar single
+# routine. The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+ .weak vrsa_log10f_
+ .set vrsa_log10f_,__vrsa_log10f__
+ .weak vrsa_log10f__
+ .set vrsa_log10f__,__vrsa_log10f__
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array logf
+#** VRSA_LOG10F(N,X,Y)
+# C equivalent*/
+#void vrsa_log10f__(int * n, float *x, float *y)
+#{
+# vrsa_log10f(*n,x,y);
+#}
+.globl __vrsa_log10f__
+ .type __vrsa_log10f__,@function
+__vrsa_log10f__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+
+
+# define local variable storage offsets
+.equ p_x,0 # save x
+.equ p_idx,0x010 # xmmword index
+.equ p_z1,0x020 # xmmword index
+.equ p_q,0x030 # xmmword index
+.equ p_corr,0x040 # xmmword index
+.equ p_omask,0x050 # xmmword index
+
+
+.equ p_x2,0x0100 # save x
+.equ p_idx2,0x0110 # xmmword index
+.equ p_z12,0x0120 # xmmword index
+.equ p_q2,0x0130 # xmmword index
+
+.equ save_xa,0x0140 #qword
+.equ save_ya,0x0148 #qword
+.equ save_nv,0x0150 #qword
+.equ p_iter,0x0158 # qword storage for number of loop iterations
+
+.equ save_rbx,0x0160 #
+.equ save_rdi,0x0168 #qword
+.equ save_rsi,0x0170 #qword
+
+.equ p2_temp,0x0180 #qword
+.equ p2_temp1,0x01a0 #qword
+
+.equ stack_size,0x01c8
+
+
+
+
+# parameters are passed in by gcc as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+.globl vrsa_log10f
+ .type vrsa_log10f,@function
+vrsa_log10f:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rdi
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $3,%rax # get number of iterations
+ jz .L__vsa_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $3,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 8 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movups (%rsi),%xmm0
+ movups 16(%rsi),%xmm12
+# movhps .LQWORD,%xmm0 PTR [rsi+8]
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+
+# check e as a special case
+ movdqa %xmm0,p_x(%rsp) # save x
+ movdqa %xmm12,p_x2(%rsp) # save x
+ movdqa %xmm0,%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movmskps %xmm2,%r9d
+
+ movdqa %xmm12,%xmm9
+ movaps %xmm12,%xmm7
+
+#
+# compute the index into the log tables
+#
+ movdqa %xmm0,%xmm3
+ movaps %xmm0,%xmm1
+ psrld $23,%xmm3
+
+ #
+ # compute the index into the log tables
+ #
+ psrld $23,%xmm9
+ subps .L__real_one(%rip),%xmm7
+ psubd .L__mask_127(%rip),%xmm9
+ subps .L__real_one(%rip),%xmm1
+ psubd .L__mask_127(%rip),%xmm3
+ cvtdq2ps %xmm9,%xmm13 # xexp
+
+ movdqa %xmm12,%xmm9
+ pand .L__real_mant(%rip),%xmm9
+ xor %r8,%r8
+ movdqa %xmm9,%xmm8
+ movaps .L__real_half(%rip),%xmm11 # .5
+ cvtdq2ps %xmm3,%xmm6 # xexp
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+ xor %r8,%r8
+ movdqa %xmm3,%xmm2
+ movaps .L__real_half(%rip),%xmm5 # .5
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrld $16,%xmm3
+ lea .L__np_ln_lead_table(%rip),%rdx
+ movdqa %xmm3,%xmm4
+ psrld $16,%xmm9
+ movdqa %xmm9,%xmm10
+ psrld $1,%xmm9
+ psrld $1,%xmm3
+ paddd .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm4
+ paddd %xmm4,%xmm3
+ cvtdq2ps %xmm3,%xmm1
+ #/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ paddd .L__mask_040(%rip),%xmm9
+ pand .L__mask_001(%rip),%xmm10
+ paddd %xmm10,%xmm9
+ cvtdq2ps %xmm9,%xmm7
+ packssdw %xmm3,%xmm3
+ movq %xmm3,p_idx(%rsp)
+ packssdw %xmm9,%xmm9
+ movq %xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+ movdqa %xmm0,%xmm3
+ orps .L__real_half(%rip),%xmm2
+
+
+ mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128
+ # reduce and get u
+
+
+ subps %xmm1,%xmm2 # f2 = f - f1
+ mulps %xmm2,%xmm5
+ addps %xmm5,%xmm1
+
+ divps %xmm1,%xmm2 # u
+
+ movdqa %xmm12,%xmm9
+ orps .L__real_half(%rip),%xmm8
+
+
+ mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128
+ subps %xmm7,%xmm8 # f2 = f - f1
+ mulps %xmm8,%xmm11
+ addps %xmm11,%xmm7
+
+
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1(%rsp) # save the f1 values
+
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1+8(%rsp) # save the f1 value
+
+ divps %xmm7,%xmm8 # u
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12(%rsp) # save the f1 values
+
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12+8(%rsp) # save the f1 value
+
+# solve for ln(1+u)
+ movaps %xmm2,%xmm1 # u
+ mulps %xmm2,%xmm2 # u^2
+ movaps %xmm2,%xmm5
+ movaps .L__real_cb3(%rip),%xmm3
+ mulps %xmm2,%xmm3 #Cu2
+ mulps %xmm1,%xmm5 # u^3
+ addps .L__real_cb2(%rip),%xmm3 #B+Cu2
+ movaps %xmm2,%xmm4
+ mulps %xmm5,%xmm4 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm2
+
+ mulps .L__real_cb1(%rip),%xmm5 #Au3
+ addps %xmm5,%xmm1 # u+Au3
+ mulps %xmm3,%xmm4 # u5(B+Cu2)
+
+ lea .L__np_ln_tail_table(%rip),%rdx
+ addps %xmm4,%xmm1 # poly
+
+# recombine
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q(%rsp) # save the f2 value
+
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q+8(%rsp) # save the f2 value
+
+ addps p_q(%rsp),%xmm1 #z2 +=q
+
+ movaps p_z1(%rsp),%xmm0 # z1 values
+
+ mulps %xmm6,%xmm2
+ addps %xmm2,%xmm0 #r1
+ movaps %xmm0,%xmm2
+ mulps .L__real_log2_tail(%rip),%xmm6
+ addps %xmm6,%xmm1 #r2
+ movaps %xmm1,%xmm3
+
+# logef to log10f
+ mulps .L__real_log10e_tail(%rip),%xmm1
+ mulps .L__real_log10e_tail(%rip),%xmm0
+ mulps .L__real_log10e_lead(%rip),%xmm3
+ mulps .L__real_log10e_lead(%rip),%xmm2
+ addps %xmm1,%xmm0
+ addps %xmm3,%xmm0
+ addps %xmm2,%xmm0
+# addps %xmm1,%xmm0
+
+
+
+# check for e
+# test $0x0f,%r9d
+ # jnz .L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+ xorps %xmm1,%xmm1
+ cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm1,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg
+
+.L__f2:
+## if +inf
+ movaps p_x(%rsp),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf
+.L__f3:
+
+ movaps p_x(%rsp),%xmm3
+ subps .L__real_one(%rip),%xmm3
+ andps .L__real_notsign(%rip),%xmm3
+ cmpps $2,.L__real_threshold(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one
+.L__f4:
+
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movups %xmm0,(%rdi)
+
+# finish the second set of calculations
+
+ # solve for ln(1+u)
+ movaps %xmm8,%xmm7 # u
+ mulps %xmm8,%xmm8 # u^2
+ movaps %xmm8,%xmm11
+
+ movaps .L__real_cb3(%rip),%xmm9
+ mulps %xmm8,%xmm9 #Cu2
+ mulps %xmm7,%xmm11 # u^3
+ addps .L__real_cb2(%rip),%xmm9 #B+Cu2
+ movaps %xmm8,%xmm10
+ mulps %xmm11,%xmm10 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm8
+
+ mulps .L__real_cb1(%rip),%xmm11 #Au3
+ addps %xmm11,%xmm7 # u+Au3
+ mulps %xmm9,%xmm10 # u5(B+Cu2)
+ addps %xmm10,%xmm7 # poly
+
+
+ # recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2(%rsp) # save the f2 value
+
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2+8(%rsp) # save the f2 value
+
+ addps p_q2(%rsp),%xmm7 #z2 +=q
+ movaps p_z12(%rsp),%xmm1 # z1 values
+
+ mulps %xmm13,%xmm8
+ addps %xmm8,%xmm1 #r1
+ movaps %xmm1,%xmm8
+ mulps .L__real_log2_tail(%rip),%xmm13
+ addps %xmm13,%xmm7 #r2
+ movaps %xmm7,%xmm9
+ # logef to log10f
+ mulps .L__real_log10e_tail(%rip),%xmm7
+ mulps .L__real_log10e_tail(%rip),%xmm1
+ mulps .L__real_log10e_lead(%rip),%xmm9
+ mulps .L__real_log10e_lead(%rip),%xmm8
+ addps %xmm7,%xmm1
+ addps %xmm9,%xmm1
+ addps %xmm8,%xmm1
+# addps %xmm7,%xmm1
+
+ # check e as a special case
+# movaps p_x2(%rsp),%xmm10
+# cmpps $0,.L__real_ef(%rip),%xmm10
+# movmskps %xmm10,%r9d
+ # check for e
+# test $0x0f,%r9d
+# jnz .L__vlogf_e2
+.L__f12:
+
+ # check for negative numbers or zero
+ xorps %xmm7,%xmm7
+ cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm7,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg2
+
+.L__f22:
+ ## if +inf
+ movaps p_x2(%rsp),%xmm9
+ cmpps $0,.L__real_inf(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf2
+.L__f32:
+
+ movaps p_x2(%rsp),%xmm9
+ subps .L__real_one(%rip),%xmm9
+ andps .L__real_notsign(%rip),%xmm9
+ cmpps $2,.L__real_threshold(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one2
+.L__f42:
+
+
+ prefetch 64(%rsi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movups %xmm1,-16(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vsa_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vsa_cleanup
+
+
+#
+.L__final_check:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+
+
+ .align 16
+# we jump here when we have an odd number of log calls to make at the
+# end
+.L__vsa_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+ xorps %xmm0,%xmm0
+ movaps %xmm0,p2_temp(%rsp)
+ movaps %xmm0,p2_temp+16(%rsp)
+
+ mov (%rsi),%ecx # we know there's at least one
+ mov %ecx,p2_temp(%rsp)
+ cmp $2,%rax
+ jl .L__vsacg
+
+ mov 4(%rsi),%ecx # do the second value
+ mov %ecx,p2_temp+4(%rsp)
+ cmp $3,%rax
+ jl .L__vsacg
+
+ mov 8(%rsi),%ecx # do the third value
+ mov %ecx,p2_temp+8(%rsp)
+ cmp $4,%rax
+ jl .L__vsacg
+
+ mov 12(%rsi),%ecx # do the fourth value
+ mov %ecx,p2_temp+12(%rsp)
+ cmp $5,%rax
+ jl .L__vsacg
+
+ mov 16(%rsi),%ecx # do the fifth value
+ mov %ecx,p2_temp+16(%rsp)
+ cmp $6,%rax
+ jl .L__vsacg
+
+ mov 20(%rsi),%ecx # do the sixth value
+ mov %ecx,p2_temp+20(%rsp)
+ cmp $7,%rax
+ jl .L__vsacg
+
+ mov 24(%rsi),%ecx # do the last value
+ mov %ecx,p2_temp+24(%rsp)
+
+.L__vsacg:
+ mov $8,%rdi # parameter for N
+ lea p2_temp(%rsp),%rsi # &x parameter
+ lea p2_temp1(%rsp),%rdx # &y parameter
+ call vrsa_log10f@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p2_temp1(%rsp),%ecx
+ mov %ecx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+4(%rsp),%ecx
+ mov %ecx,4(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+8(%rsp),%ecx
+ mov %ecx,8(%rdi) # do the second value
+ cmp $4,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+12(%rsp),%ecx
+ mov %ecx,12(%rdi) # do the second value
+ cmp $5,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+16(%rsp),%ecx
+ mov %ecx,16(%rdi) # do the second value
+ cmp $6,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+20(%rsp),%ecx
+ mov %ecx,20(%rdi) # do the second value
+ cmp $7,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+24(%rsp),%ecx
+ mov %ecx,24(%rdi) # do the last value
+
+.L__vsacgf:
+ jmp .L__final_check
+
+
+.L__vlogf_e:
+ movdqa p_x(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ jmp .L__f1
+
+.L__vlogf_e2:
+ movdqa p_x2(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ jmp .L__f12
+
+ .align 16
+.L__near_one:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm3,p_omask(%rsp) # save ones mask
+ movaps p_x(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+# u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm1
+ divps %xmm2,%xmm1 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm1,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+# u = u + u;
+ addps %xmm1,%xmm1 #u
+ movaps %xmm1,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm1,%xmm5 # Cu
+ movaps %xmm1,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm1
+ mulps %xmm1,%xmm1 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm1 #u6(Cu+Du3)
+ addps %xmm1,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+# loge to log10
+ movaps %xmm3,%xmm5 #r1=r
+ pand .L__mask_lower(%rip),%xmm5
+ subps %xmm5,%xmm3
+ addps %xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+ movaps %xmm5,%xmm3
+ movaps %xmm2,%xmm1
+
+ mulps .L__real_log10e_tail(%rip),%xmm2
+ mulps .L__real_log10e_tail(%rip),%xmm3
+ mulps .L__real_log10e_lead(%rip),%xmm1
+ mulps .L__real_log10e_lead(%rip),%xmm5
+ addps %xmm2,%xmm3
+ addps %xmm1,%xmm3
+ addps %xmm5,%xmm3
+# return r + r2;
+# addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm0,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+
+ jmp .L__f4
+
+
+ .align 16
+.L__near_one2:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm9,p_omask(%rsp) # save ones mask
+ movaps p_x2(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+ # u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm7
+ divps %xmm2,%xmm7 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+ # correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm7,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+ # u = u + u;
+ addps %xmm7,%xmm7 #u
+ movaps %xmm7,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+ # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm7,%xmm5 # Cu
+ movaps %xmm7,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm7
+ mulps %xmm7,%xmm7 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm7 #u6(Cu+Du3)
+ addps %xmm7,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+ #loge to log10
+ movaps %xmm3,%xmm5 #r1=r
+ pand .L__mask_lower(%rip),%xmm5
+ subps %xmm5,%xmm3
+ addps %xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+ movaps %xmm5,%xmm3
+ movaps %xmm2,%xmm7
+
+ mulps .L__real_log10e_tail(%rip),%xmm2
+ mulps .L__real_log10e_tail(%rip),%xmm3
+ mulps .L__real_log10e_lead(%rip),%xmm7
+ mulps .L__real_log10e_lead(%rip),%xmm5
+ addps %xmm2,%xmm3
+ addps %xmm7,%xmm3
+ addps %xmm5,%xmm3
+ # return r + r2;
+# addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm1,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+
+ jmp .L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+ movdqa %xmm1,%xmm3
+ andps %xmm0,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm1 # setup the nan values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+# check for +/- 0
+ xorps %xmm1,%xmm1
+ cmpps $0,p_x(%rsp),%xmm1 # 0 ?.
+ movmskps %xmm1,%r9d
+ test $0x0f,%r9d
+ jz .L__zn2
+
+ movdqa %xmm1,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+
+.L__zn2:
+# check for NaNs
+ movaps p_x(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x(%rsp),%xmm1 # isolate the NaNs
+ pand %xmm4,%xmm1
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm1,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm1
+ andnps %xmm0,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f2
+
+# handle only +inf log(+inf) = inf
+.L__log_inf:
+ movdqa %xmm3,%xmm1
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps p_x(%rsp),%xmm1 # setup the +inf values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+ jmp .L__f3
+
+
+.L__z_or_neg2:
+ # deal with negatives first
+ movdqa %xmm7,%xmm3
+ andps %xmm1,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm7 # setup the nan values
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ # check for +/- 0
+ xorps %xmm7,%xmm7
+ cmpps $0,p_x2(%rsp),%xmm7 # 0 ?.
+ movmskps %xmm7,%r9d
+ test $0x0f,%r9d
+ jz .L__zn22
+
+ movdqa %xmm7,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+
+.L__zn22:
+ # check for NaNs
+ movaps p_x2(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x2(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x2(%rsp),%xmm7 # isolate the NaNs
+ pand %xmm4,%xmm7
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm7,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm7
+ andnps %xmm1,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f22
+
+ # handle only +inf log(+inf) = inf
+.L__log_inf2:
+ movdqa %xmm9,%xmm7
+ andnps %xmm1,%xmm9 # keep the non-error values
+ andps p_x2(%rsp),%xmm7 # setup the +inf values
+ orps %xmm9,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ jmp .L__f32
+
+
+ .data
+ .align 64
+
+
+.L__real_zero: .quad 0x00000000000000000 # 1.0
+ .quad 0x00000000000000000
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03c0000003c000000
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+.L__mask_040: .quad 0x00000004000000040 #
+ .quad 0x00000004000000040
+.L__mask_001: .quad 0x00000000100000001 #
+ .quad 0x00000000100000001
+
+
+.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03
+ .quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03
+ .quad 0x03B1249183B124918
+.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04
+ .quad 0x039E401A639E401A6
+.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03
+ .quad 0x03B124A123B124A12
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+
+.L__real_log10e_lead: .quad 0x03EDE00003EDE0000 # log10e_lead 0.4335937500
+ .quad 0x03EDE00003EDE0000
+.L__real_log10e_tail: .quad 0x03A37B1523A37B152 # log10e_tail 0.0007007319
+ .quad 0x03A37B1523A37B152
+
+
+.L__mask_lower: .quad 0x0ffff0000ffff0000 #
+ .quad 0x0ffff0000ffff0000
+
+.L__np_ln__table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02
+ .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02
+ .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02
+ .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02
+ .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02
+ .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02
+ .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01
+ .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01
+ .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01
+ .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01
+ .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01
+ .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01
+ .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01
+ .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01
+ .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01
+ .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01
+ .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01
+ .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01
+ .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01
+ .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01
+ .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01
+ .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01
+ .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01
+ .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01
+ .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01
+ .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01
+ .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01
+ .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01
+ .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01
+ .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01
+ .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01
+ .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01
+ .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01
+ .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01
+ .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01
+ .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01
+ .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01
+ .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01
+ .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01
+ .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01
+ .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01
+ .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01
+ .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01
+ .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01
+ .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01
+ .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01
+ .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01
+ .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01
+ .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01
+ .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01
+ .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01
+ .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01
+ .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01
+ .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01
+ .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01
+ .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01
+ .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01
+ .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01
+ .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01
+ .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01
+ .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01
+ .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01
+ .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01
+ .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_lead_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x3C7E0000 # 0.015502929688 1
+ .long 0x3CFC1000 # 0.030769348145 2
+ .long 0x3D3BA000 # 0.045806884766 3
+ .long 0x3D785000 # 0.060623168945 4
+ .long 0x3D9A0000 # 0.075195312500 5
+ .long 0x3DB78000 # 0.089599609375 6
+ .long 0x3DD49000 # 0.103790283203 7
+ .long 0x3DF13000 # 0.117767333984 8
+ .long 0x3E06B000 # 0.131530761719 9
+ .long 0x3E14A000 # 0.145141601563 10
+ .long 0x3E226000 # 0.158569335938 11
+ .long 0x3E2FF000 # 0.171813964844 12
+ .long 0x3E3D5000 # 0.184875488281 13
+ .long 0x3E4A9000 # 0.197814941406 14
+ .long 0x3E579000 # 0.210510253906 15
+ .long 0x3E647000 # 0.223083496094 16
+ .long 0x3E713000 # 0.235534667969 17
+ .long 0x3E7DC000 # 0.247802734375 18
+ .long 0x3E851000 # 0.259887695313 19
+ .long 0x3E8B3000 # 0.271850585938 20
+ .long 0x3E914000 # 0.283691406250 21
+ .long 0x3E974000 # 0.295410156250 22
+ .long 0x3E9D3000 # 0.307006835938 23
+ .long 0x3EA30000 # 0.318359375000 24
+ .long 0x3EA8D000 # 0.329711914063 25
+ .long 0x3EAE8000 # 0.340820312500 26
+ .long 0x3EB43000 # 0.351928710938 27
+ .long 0x3EB9C000 # 0.362792968750 28
+ .long 0x3EBF5000 # 0.373657226563 29
+ .long 0x3EC4D000 # 0.384399414063 30
+ .long 0x3ECA3000 # 0.394897460938 31
+ .long 0x3ECF9000 # 0.405395507813 32
+ .long 0x3ED4E000 # 0.415771484375 33
+ .long 0x3EDA2000 # 0.426025390625 34
+ .long 0x3EDF5000 # 0.436157226563 35
+ .long 0x3EE47000 # 0.446166992188 36
+ .long 0x3EE99000 # 0.456176757813 37
+ .long 0x3EEEA000 # 0.466064453125 38
+ .long 0x3EF3A000 # 0.475830078125 39
+ .long 0x3EF89000 # 0.485473632813 40
+ .long 0x3EFD7000 # 0.494995117188 41
+ .long 0x3F012000 # 0.504394531250 42
+ .long 0x3F039000 # 0.513916015625 43
+ .long 0x3F05F000 # 0.523193359375 44
+ .long 0x3F084000 # 0.532226562500 45
+ .long 0x3F0AA000 # 0.541503906250 46
+ .long 0x3F0CF000 # 0.550537109375 47
+ .long 0x3F0F4000 # 0.559570312500 48
+ .long 0x3F118000 # 0.568359375000 49
+ .long 0x3F13C000 # 0.577148437500 50
+ .long 0x3F160000 # 0.585937500000 51
+ .long 0x3F183000 # 0.594482421875 52
+ .long 0x3F1A7000 # 0.603271484375 53
+ .long 0x3F1C9000 # 0.611572265625 54
+ .long 0x3F1EC000 # 0.620117187500 55
+ .long 0x3F20E000 # 0.628417968750 56
+ .long 0x3F230000 # 0.636718750000 57
+ .long 0x3F252000 # 0.645019531250 58
+ .long 0x3F273000 # 0.653076171875 59
+ .long 0x3F295000 # 0.661376953125 60
+ .long 0x3F2B5000 # 0.669189453125 61
+ .long 0x3F2D6000 # 0.677246093750 62
+ .long 0x3F2F7000 # 0.685302734375 63
+ .long 0x3F317000 # 0.693115234375 64
+ .long 0 # for alignment
+
+.L__np_ln_tail_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x35A8B0FC # 0.000001256848 1
+ .long 0x361B0E78 # 0.000002310522 2
+ .long 0x3631EC66 # 0.000002651266 3
+ .long 0x35C30046 # 0.000001452871 4
+ .long 0x37EBCB0E # 0.000028108738 5
+ .long 0x37528AE5 # 0.000012549314 6
+ .long 0x36DA7496 # 0.000006510479 7
+ .long 0x3783B715 # 0.000015701671 8
+ .long 0x383F3E68 # 0.000045596069 9
+ .long 0x38297C10 # 0.000040408282 10
+ .long 0x3815B666 # 0.000035694240 11
+ .long 0x38183854 # 0.000036292084 12
+ .long 0x38448108 # 0.000046850211 13
+ .long 0x373539E9 # 0.000010801924 14
+ .long 0x3864A740 # 0.000054515200 15
+ .long 0x387BE3CD # 0.000060055219 16
+ .long 0x3803B715 # 0.000031403342 17
+ .long 0x380C36AF # 0.000033429529 18
+ .long 0x3892713A # 0.000069829126 19
+ .long 0x38AE55D6 # 0.000083129547 20
+ .long 0x38A0FDE8 # 0.000076766883 21
+ .long 0x3862BAE1 # 0.000054056643 22
+ .long 0x3798AAD3 # 0.000018199358 23
+ .long 0x38C5E10E # 0.000094356117 24
+ .long 0x382D872E # 0.000041372310 25
+ .long 0x38DEDFAC # 0.000106274470 26
+ .long 0x38481E9B # 0.000047712219 27
+ .long 0x38EBFB5E # 0.000112524940 28
+ .long 0x38783B83 # 0.000059183232 29
+ .long 0x374E1B05 # 0.000012284848 30
+ .long 0x38CA0E11 # 0.000096347307 31
+ .long 0x3891F660 # 0.000069600297 32
+ .long 0x386C9A9A # 0.000056410769 33
+ .long 0x38777BCD # 0.000059004688 34
+ .long 0x38A6CED4 # 0.000079540216 35
+ .long 0x38FBE3CD # 0.000120110439 36
+ .long 0x387E7E01 # 0.000060675669 37
+ .long 0x37D40984 # 0.000025276800 38
+ .long 0x3784C3AD # 0.000015826745 39
+ .long 0x380F5FAF # 0.000034182969 40
+ .long 0x38AC47BC # 0.000082149607 41
+ .long 0x392952D3 # 0.000161479504 42
+ .long 0x37F97073 # 0.000029735476 43
+ .long 0x3865C84A # 0.000054784388 44
+ .long 0x3979CF17 # 0.000238236375 45
+ .long 0x38C3D2F5 # 0.000093376184 46
+ .long 0x38E6B468 # 0.000110008579 47
+ .long 0x383EBCE1 # 0.000045475437 48
+ .long 0x39186BDF # 0.000145360347 49
+ .long 0x392F0945 # 0.000166927537 50
+ .long 0x38E9ED45 # 0.000111545007 51
+ .long 0x396B99A8 # 0.000224685878 52
+ .long 0x37A27674 # 0.000019367064 53
+ .long 0x397069AB # 0.000229275480 54
+ .long 0x39013539 # 0.000123222257 55
+ .long 0x3947F423 # 0.000190690669 56
+ .long 0x3945E10E # 0.000188712234 57
+ .long 0x38F85DB0 # 0.000118430122 58
+ .long 0x396C08DC # 0.000225100142 59
+ .long 0x37B4996F # 0.000021529120 60
+ .long 0x397CEADA # 0.000241200818 61
+ .long 0x3920261B # 0.000152729845 62
+ .long 0x35AA4906 # 0.000001268724 63
+ .long 0x3805FDF4 # 0.000031946183 64
+ .long 0 # for alignment
+
diff --git a/src/gas/vrsalog2f.S b/src/gas/vrsalog2f.S
new file mode 100644
index 0000000..9760d9f
--- /dev/null
+++ b/src/gas/vrsalog2f.S
@@ -0,0 +1,1140 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsalog2f.s
+#
+# An array implementation of the logf libm function.
+#
+# Prototype:
+#
+# void vrsa_log2f(int n, float *x, float *y);
+#
+# Computes the natural log of x.
+# Places the results into the supplied y array.
+# Does not perform error handling, but does return C99 values for error
+# inputs. Denormal results are truncated to 0.
+
+# This array version is basically a unrolling of the by4 scalar single
+# routine. The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .weak vrsa_log2f_
+ .set vrsa_log2f_,__vrsa_log2f__
+ .weak vrsa_log2f__
+ .set vrsa_log2f__,__vrsa_log2f__
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array logf
+#** VRSA_LOG2F(N,X,Y)
+# C equivalent*/
+#void vrsa_log2f__(int * n, float *x, float *y)
+#{
+# vrsa_log2f(*n,x,y);
+#}
+.globl __vrsa_log2f__
+ .type __vrsa_log2f__,@function
+__vrsa_log2f__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+
+
+# define local variable storage offsets
+.equ p_x,0 # save x
+.equ p_idx,0x010 # xmmword index
+.equ p_z1,0x020 # xmmword index
+.equ p_q,0x030 # xmmword index
+.equ p_corr,0x040 # xmmword index
+.equ p_omask,0x050 # xmmword index
+
+
+.equ p_x2,0x0100 # save x
+.equ p_idx2,0x0110 # xmmword index
+.equ p_z12,0x0120 # xmmword index
+.equ p_q2,0x0130 # xmmword index
+
+.equ save_xa,0x0140 #qword
+.equ save_ya,0x0148 #qword
+.equ save_nv,0x0150 #qword
+.equ p_iter,0x0158 # qword storage for number of loop iterations
+
+.equ save_rbx,0x0160 #
+.equ save_rdi,0x0168 #qword
+.equ save_rsi,0x0170 #qword
+
+.equ p2_temp,0x0180 #qword
+.equ p2_temp1,0x01a0 #qword
+
+.equ stack_size,0x01c8
+
+
+
+
+# parameters are passed in by gcc as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+.globl vrsa_log2f
+ .type vrsa_log2f,@function
+vrsa_log2f:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rdi
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $3,%rax # get number of iterations
+ jz .L__vsa_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $3,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 8 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movups (%rsi),%xmm0
+ movups 16(%rsi),%xmm12
+# movhps .LQWORD,%xmm0 PTR [rsi+8]
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+
+# check e as a special case
+ movdqa %xmm0,p_x(%rsp) # save x
+ movdqa %xmm12,p_x2(%rsp) # save x
+ movdqa %xmm0,%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movmskps %xmm2,%r9d
+
+ movdqa %xmm12,%xmm9
+ movaps %xmm12,%xmm7
+
+#
+# compute the index into the log tables
+#
+ movdqa %xmm0,%xmm3
+ movaps %xmm0,%xmm1
+ psrld $23,%xmm3
+
+ #
+ # compute the index into the log tables
+ #
+ psrld $23,%xmm9
+ subps .L__real_one(%rip),%xmm7
+ psubd .L__mask_127(%rip),%xmm9
+ subps .L__real_one(%rip),%xmm1
+ psubd .L__mask_127(%rip),%xmm3
+ cvtdq2ps %xmm9,%xmm13 # xexp
+
+ movdqa %xmm12,%xmm9
+ pand .L__real_mant(%rip),%xmm9
+ xor %r8,%r8
+ movdqa %xmm9,%xmm8
+ movaps .L__real_half(%rip),%xmm11 # .5
+ cvtdq2ps %xmm3,%xmm6 # xexp
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+ xor %r8,%r8
+ movdqa %xmm3,%xmm2
+ movaps .L__real_half(%rip),%xmm5 # .5
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrld $16,%xmm3
+ lea .L__np_ln_lead_table(%rip),%rdx
+ movdqa %xmm3,%xmm4
+ psrld $16,%xmm9
+ movdqa %xmm9,%xmm10
+ psrld $1,%xmm9
+ psrld $1,%xmm3
+ paddd .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm4
+ paddd %xmm4,%xmm3
+ cvtdq2ps %xmm3,%xmm1
+ #/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ paddd .L__mask_040(%rip),%xmm9
+ pand .L__mask_001(%rip),%xmm10
+ paddd %xmm10,%xmm9
+ cvtdq2ps %xmm9,%xmm7
+ packssdw %xmm3,%xmm3
+ movq %xmm3,p_idx(%rsp)
+ packssdw %xmm9,%xmm9
+ movq %xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+ movdqa %xmm0,%xmm3
+ orps .L__real_half(%rip),%xmm2
+
+
+ mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128
+ # reduce and get u
+
+
+ subps %xmm1,%xmm2 # f2 = f - f1
+ mulps %xmm2,%xmm5
+ addps %xmm5,%xmm1
+
+ divps %xmm1,%xmm2 # u
+
+ movdqa %xmm12,%xmm9
+ orps .L__real_half(%rip),%xmm8
+
+
+ mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128
+ subps %xmm7,%xmm8 # f2 = f - f1
+ mulps %xmm8,%xmm11
+ addps %xmm11,%xmm7
+
+
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1(%rsp) # save the f1 values
+
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1+8(%rsp) # save the f1 value
+
+ divps %xmm7,%xmm8 # u
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12(%rsp) # save the f1 values
+
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12+8(%rsp) # save the f1 value
+
+# solve for ln(1+u)
+ movaps %xmm2,%xmm1 # u
+ mulps %xmm2,%xmm2 # u^2
+ movaps %xmm2,%xmm5
+ movaps .L__real_cb3(%rip),%xmm3
+ mulps %xmm2,%xmm3 #Cu2
+ mulps %xmm1,%xmm5 # u^3
+ addps .L__real_cb2(%rip),%xmm3 #B+Cu2
+ movaps %xmm2,%xmm4
+ mulps %xmm5,%xmm4 # u^5
+ movaps .L__real_log2e_lead(%rip),%xmm2
+
+ mulps .L__real_cb1(%rip),%xmm5 #Au3
+ addps %xmm5,%xmm1 # u+Au3
+ mulps %xmm3,%xmm4 # u5(B+Cu2)
+ movaps .L__real_log2e_tail(%rip),%xmm3
+ lea .L__np_ln_tail_table(%rip),%rdx
+ addps %xmm4,%xmm1 # poly
+
+# recombine
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q(%rsp) # save the f2 value
+
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q+8(%rsp) # save the f2 value
+
+ addps p_q(%rsp),%xmm1 #z2 +=q
+ movaps %xmm1,%xmm4 #z2 copy
+ movaps p_z1(%rsp),%xmm0 # z1 values
+ movaps %xmm0,%xmm5 #z1 copy
+
+ mulps %xmm2,%xmm5 #z1*log2e_lead
+ mulps %xmm2,%xmm1 #z2*log2e_lead
+ mulps %xmm3,%xmm4 #z2*log2e_tail
+ mulps %xmm3,%xmm0 #z1*log2e_tail
+ addps %xmm6,%xmm5 #r1 = z1*log2e_lead + xexp
+ addps %xmm4,%xmm0 #z1*log2e_tail + z2*log2e_tail
+ addps %xmm1,%xmm0 #r2
+#return r1+r2
+ addps %xmm5,%xmm0 # r1+ r2
+
+
+# check for e
+# test $0x0f,%r9d
+# jnz .L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+ xorps %xmm1,%xmm1
+ cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm1,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg
+
+.L__f2:
+### if +inf
+ movaps p_x(%rsp),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf
+.L__f3:
+
+ movaps p_x(%rsp),%xmm3
+ subps .L__real_one(%rip),%xmm3
+ andps .L__real_notsign(%rip),%xmm3
+ cmpps $2,.L__real_threshold(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one
+.L__f4:
+
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movups %xmm0,(%rdi)
+
+# finish the second set of calculations
+
+ # solve for ln(1+u)
+ movaps %xmm8,%xmm7 # u
+ mulps %xmm8,%xmm8 # u^2
+ movaps %xmm8,%xmm11
+
+ movaps .L__real_cb3(%rip),%xmm9
+ mulps %xmm8,%xmm9 #Cu2
+ mulps %xmm7,%xmm11 # u^3
+ addps .L__real_cb2(%rip),%xmm9 #B+Cu2
+ movaps %xmm8,%xmm10
+ mulps %xmm11,%xmm10 # u^5
+ movaps .L__real_log2e_lead(%rip),%xmm8
+
+ mulps .L__real_cb1(%rip),%xmm11 #Au3
+ addps %xmm11,%xmm7 # u+Au3
+ mulps %xmm9,%xmm10 # u5(B+Cu2)
+ movaps .L__real_log2e_tail(%rip),%xmm9
+ addps %xmm10,%xmm7 # poly
+
+
+ # recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2(%rsp) # save the f2 value
+
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2+8(%rsp) # save the f2 value
+
+ addps p_q2(%rsp),%xmm7 #z2 +=q
+ movaps %xmm7,%xmm10 #z2 copy
+ movaps p_z12(%rsp),%xmm1 # z1 values
+ movaps %xmm1,%xmm11 #z1 copy
+
+ mulps %xmm8,%xmm11 #z1*log2e_lead
+ mulps %xmm8,%xmm7 #z2*log2e_lead
+ mulps %xmm9,%xmm10 #z2*log2e_tail
+ mulps %xmm9,%xmm1 #z1*log2e_tail
+ addps %xmm13,%xmm11 #r1 = z1*log2e_lead + xexp
+ addps %xmm10,%xmm1 #z1*log2e_tail + z2*log2e_tail
+ addps %xmm7,%xmm1 #r2
+ #return r1+r2
+ addps %xmm11,%xmm1 # r1+ r2
+
+ # check e as a special case
+# movaps p_x2(%rsp),%xmm10
+# cmpps $0,.L__real_ef(%rip),%xmm10
+# movmskps %xmm10,%r9d
+ # check for e
+# test $0x0f,%r9d
+# jnz .L__vlogf_e2
+.L__f12:
+
+ # check for negative numbers or zero
+ xorps %xmm7,%xmm7
+ cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm7,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg2
+
+.L__f22:
+ ### if +inf
+ movaps p_x2(%rsp),%xmm9
+ cmpps $0,.L__real_inf(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf2
+.L__f32:
+
+ movaps p_x2(%rsp),%xmm9
+ subps .L__real_one(%rip),%xmm9
+ andps .L__real_notsign(%rip),%xmm9
+ cmpps $2,.L__real_threshold(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one2
+.L__f42:
+
+
+ prefetch 64(%rsi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movups %xmm1,-16(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vsa_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vsa_cleanup
+
+
+#
+.L__final_check:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+
+
+ .align 16
+# we jump here when we have an odd number of log calls to make at the
+# end
+.L__vsa_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+ xorps %xmm0,%xmm0
+ movaps %xmm0,p2_temp(%rsp)
+ movaps %xmm0,p2_temp+16(%rsp)
+
+ mov (%rsi),%ecx # we know there's at least one
+ mov %ecx,p2_temp(%rsp)
+ cmp $2,%rax
+ jl .L__vsacg
+
+ mov 4(%rsi),%ecx # do the second value
+ mov %ecx,p2_temp+4(%rsp)
+ cmp $3,%rax
+ jl .L__vsacg
+
+ mov 8(%rsi),%ecx # do the third value
+ mov %ecx,p2_temp+8(%rsp)
+ cmp $4,%rax
+ jl .L__vsacg
+
+ mov 12(%rsi),%ecx # do the fourth value
+ mov %ecx,p2_temp+12(%rsp)
+ cmp $5,%rax
+ jl .L__vsacg
+
+ mov 16(%rsi),%ecx # do the fifth value
+ mov %ecx,p2_temp+16(%rsp)
+ cmp $6,%rax
+ jl .L__vsacg
+
+ mov 20(%rsi),%ecx # do the sixth value
+ mov %ecx,p2_temp+20(%rsp)
+ cmp $7,%rax
+ jl .L__vsacg
+
+ mov 24(%rsi),%ecx # do the last value
+ mov %ecx,p2_temp+24(%rsp)
+
+.L__vsacg:
+ mov $8,%rdi # parameter for N
+ lea p2_temp(%rsp),%rsi # &x parameter
+ lea p2_temp1(%rsp),%rdx # &y parameter
+ call vrsa_log2f@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p2_temp1(%rsp),%ecx
+ mov %ecx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+4(%rsp),%ecx
+ mov %ecx,4(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+8(%rsp),%ecx
+ mov %ecx,8(%rdi) # do the second value
+ cmp $4,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+12(%rsp),%ecx
+ mov %ecx,12(%rdi) # do the second value
+ cmp $5,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+16(%rsp),%ecx
+ mov %ecx,16(%rdi) # do the second value
+ cmp $6,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+20(%rsp),%ecx
+ mov %ecx,20(%rdi) # do the second value
+ cmp $7,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+24(%rsp),%ecx
+ mov %ecx,24(%rdi) # do the last value
+
+.L__vsacgf:
+ jmp .L__final_check
+
+
+.L__vlogf_e:
+ movdqa p_x(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ jmp .L__f1
+
+.L__vlogf_e2:
+ movdqa p_x2(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ jmp .L__f12
+
+ .align 16
+.L__near_one:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm3,p_omask(%rsp) # save ones mask
+ movaps p_x(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+# u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm1
+ divps %xmm2,%xmm1 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm1,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+# u = u + u;
+ addps %xmm1,%xmm1 #u
+ movaps %xmm1,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm1,%xmm5 # Cu
+ movaps %xmm1,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm1
+ mulps %xmm1,%xmm1 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm1 #u6(Cu+Du3)
+ addps %xmm1,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+
+# loge to log2
+ movaps %xmm3,%xmm5 #r1=r
+ pand .L__mask_lower(%rip),%xmm5
+ subps %xmm5,%xmm3
+ addps %xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+ movaps %xmm5,%xmm3
+ movaps %xmm2,%xmm1
+
+ mulps .L__real_log2e_tail(%rip),%xmm2
+ mulps .L__real_log2e_tail(%rip),%xmm3
+ mulps .L__real_log2e_lead(%rip),%xmm1
+ mulps .L__real_log2e_lead(%rip),%xmm5
+ addps %xmm2,%xmm3
+ addps %xmm1,%xmm3
+ addps %xmm5,%xmm3
+
+# return r + r2;
+# addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm0,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+
+ jmp .L__f4
+
+
+ .align 16
+.L__near_one2:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm9,p_omask(%rsp) # save ones mask
+ movaps p_x2(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+ # u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm7
+ divps %xmm2,%xmm7 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+ # correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm7,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+ # u = u + u;
+ addps %xmm7,%xmm7 #u
+ movaps %xmm7,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+ # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm7,%xmm5 # Cu
+ movaps %xmm7,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm7
+ mulps %xmm7,%xmm7 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm7 #u6(Cu+Du3)
+ addps %xmm7,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+# loge to log2
+ movaps %xmm3,%xmm5 #r1=r
+ pand .L__mask_lower(%rip),%xmm5
+ subps %xmm5,%xmm3
+ addps %xmm3,%xmm2 #r2 = r2 + (r-r1)
+
+ movaps %xmm5,%xmm3
+ movaps %xmm2,%xmm7
+
+ mulps .L__real_log2e_tail(%rip),%xmm2
+ mulps .L__real_log2e_tail(%rip),%xmm3
+ mulps .L__real_log2e_lead(%rip),%xmm7
+ mulps .L__real_log2e_lead(%rip),%xmm5
+ addps %xmm2,%xmm3
+ addps %xmm7,%xmm3
+ addps %xmm5,%xmm3
+
+ # return r + r2;
+# addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm1,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+
+ jmp .L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+ movdqa %xmm1,%xmm3
+ andps %xmm0,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm1 # setup the nan values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+# check for +/- 0
+ xorps %xmm1,%xmm1
+ cmpps $0,p_x(%rsp),%xmm1 # 0 ?.
+ movmskps %xmm1,%r9d
+ test $0x0f,%r9d
+ jz .L__zn2
+
+ movdqa %xmm1,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+
+.L__zn2:
+# check for NaNs
+ movaps p_x(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x(%rsp),%xmm1 # isolate the NaNs
+ pand %xmm4,%xmm1
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm1,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm1
+ andnps %xmm0,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f2
+
+# handle only +inf log(+inf) = inf
+.L__log_inf:
+ movdqa %xmm3,%xmm1
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps p_x(%rsp),%xmm1 # setup the +inf values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+ jmp .L__f3
+
+
+.L__z_or_neg2:
+ # deal with negatives first
+ movdqa %xmm7,%xmm3
+ andps %xmm1,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm7 # setup the nan values
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ # check for +/- 0
+ xorps %xmm7,%xmm7
+ cmpps $0,p_x2(%rsp),%xmm7 # 0 ?.
+ movmskps %xmm7,%r9d
+ test $0x0f,%r9d
+ jz .L__zn22
+
+ movdqa %xmm7,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+
+.L__zn22:
+ # check for NaNs
+ movaps p_x2(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x2(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x2(%rsp),%xmm7 # isolate the NaNs
+ pand %xmm4,%xmm7
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm7,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm7
+ andnps %xmm1,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f22
+
+ # handle only +inf log(+inf) = inf
+.L__log_inf2:
+ movdqa %xmm9,%xmm7
+ andnps %xmm1,%xmm9 # keep the non-error values
+ andps p_x2(%rsp),%xmm7 # setup the +inf values
+ orps %xmm9,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ jmp .L__f32
+
+
+ .data
+ .align 64
+
+
+.L__real_zero: .quad 0x00000000000000000 # 1.0
+ .quad 0x00000000000000000
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03c0000003c000000
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+.L__mask_040: .quad 0x00000004000000040 #
+ .quad 0x00000004000000040
+.L__mask_001: .quad 0x00000000100000001 #
+ .quad 0x00000000100000001
+
+
+.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03
+ .quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03
+ .quad 0x03B1249183B124918
+.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04
+ .quad 0x039E401A639E401A6
+.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03
+ .quad 0x03B124A123B124A12
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+.L__real_log2e_lead: .quad 0x03FB800003FB80000 #1.4375000000
+ .quad 0x03FB800003FB80000
+.L__real_log2e_tail: .quad 0x03BAA3B293BAA3B29 # 0.0051950408889633
+ .quad 0x03BAA3B293BAA3B29
+
+.L__mask_lower: .quad 0x0ffff0000ffff0000 #
+ .quad 0x0ffff0000ffff0000
+
+.L__np_ln__table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02
+ .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02
+ .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02
+ .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02
+ .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02
+ .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02
+ .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01
+ .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01
+ .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01
+ .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01
+ .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01
+ .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01
+ .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01
+ .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01
+ .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01
+ .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01
+ .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01
+ .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01
+ .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01
+ .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01
+ .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01
+ .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01
+ .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01
+ .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01
+ .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01
+ .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01
+ .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01
+ .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01
+ .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01
+ .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01
+ .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01
+ .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01
+ .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01
+ .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01
+ .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01
+ .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01
+ .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01
+ .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01
+ .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01
+ .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01
+ .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01
+ .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01
+ .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01
+ .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01
+ .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01
+ .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01
+ .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01
+ .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01
+ .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01
+ .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01
+ .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01
+ .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01
+ .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01
+ .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01
+ .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01
+ .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01
+ .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01
+ .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01
+ .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01
+ .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01
+ .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01
+ .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01
+ .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01
+ .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_lead_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x3C7E0000 # 0.015502929688 1
+ .long 0x3CFC1000 # 0.030769348145 2
+ .long 0x3D3BA000 # 0.045806884766 3
+ .long 0x3D785000 # 0.060623168945 4
+ .long 0x3D9A0000 # 0.075195312500 5
+ .long 0x3DB78000 # 0.089599609375 6
+ .long 0x3DD49000 # 0.103790283203 7
+ .long 0x3DF13000 # 0.117767333984 8
+ .long 0x3E06B000 # 0.131530761719 9
+ .long 0x3E14A000 # 0.145141601563 10
+ .long 0x3E226000 # 0.158569335938 11
+ .long 0x3E2FF000 # 0.171813964844 12
+ .long 0x3E3D5000 # 0.184875488281 13
+ .long 0x3E4A9000 # 0.197814941406 14
+ .long 0x3E579000 # 0.210510253906 15
+ .long 0x3E647000 # 0.223083496094 16
+ .long 0x3E713000 # 0.235534667969 17
+ .long 0x3E7DC000 # 0.247802734375 18
+ .long 0x3E851000 # 0.259887695313 19
+ .long 0x3E8B3000 # 0.271850585938 20
+ .long 0x3E914000 # 0.283691406250 21
+ .long 0x3E974000 # 0.295410156250 22
+ .long 0x3E9D3000 # 0.307006835938 23
+ .long 0x3EA30000 # 0.318359375000 24
+ .long 0x3EA8D000 # 0.329711914063 25
+ .long 0x3EAE8000 # 0.340820312500 26
+ .long 0x3EB43000 # 0.351928710938 27
+ .long 0x3EB9C000 # 0.362792968750 28
+ .long 0x3EBF5000 # 0.373657226563 29
+ .long 0x3EC4D000 # 0.384399414063 30
+ .long 0x3ECA3000 # 0.394897460938 31
+ .long 0x3ECF9000 # 0.405395507813 32
+ .long 0x3ED4E000 # 0.415771484375 33
+ .long 0x3EDA2000 # 0.426025390625 34
+ .long 0x3EDF5000 # 0.436157226563 35
+ .long 0x3EE47000 # 0.446166992188 36
+ .long 0x3EE99000 # 0.456176757813 37
+ .long 0x3EEEA000 # 0.466064453125 38
+ .long 0x3EF3A000 # 0.475830078125 39
+ .long 0x3EF89000 # 0.485473632813 40
+ .long 0x3EFD7000 # 0.494995117188 41
+ .long 0x3F012000 # 0.504394531250 42
+ .long 0x3F039000 # 0.513916015625 43
+ .long 0x3F05F000 # 0.523193359375 44
+ .long 0x3F084000 # 0.532226562500 45
+ .long 0x3F0AA000 # 0.541503906250 46
+ .long 0x3F0CF000 # 0.550537109375 47
+ .long 0x3F0F4000 # 0.559570312500 48
+ .long 0x3F118000 # 0.568359375000 49
+ .long 0x3F13C000 # 0.577148437500 50
+ .long 0x3F160000 # 0.585937500000 51
+ .long 0x3F183000 # 0.594482421875 52
+ .long 0x3F1A7000 # 0.603271484375 53
+ .long 0x3F1C9000 # 0.611572265625 54
+ .long 0x3F1EC000 # 0.620117187500 55
+ .long 0x3F20E000 # 0.628417968750 56
+ .long 0x3F230000 # 0.636718750000 57
+ .long 0x3F252000 # 0.645019531250 58
+ .long 0x3F273000 # 0.653076171875 59
+ .long 0x3F295000 # 0.661376953125 60
+ .long 0x3F2B5000 # 0.669189453125 61
+ .long 0x3F2D6000 # 0.677246093750 62
+ .long 0x3F2F7000 # 0.685302734375 63
+ .long 0x3F317000 # 0.693115234375 64
+ .long 0 # for alignment
+
+.L__np_ln_tail_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x35A8B0FC # 0.000001256848 1
+ .long 0x361B0E78 # 0.000002310522 2
+ .long 0x3631EC66 # 0.000002651266 3
+ .long 0x35C30046 # 0.000001452871 4
+ .long 0x37EBCB0E # 0.000028108738 5
+ .long 0x37528AE5 # 0.000012549314 6
+ .long 0x36DA7496 # 0.000006510479 7
+ .long 0x3783B715 # 0.000015701671 8
+ .long 0x383F3E68 # 0.000045596069 9
+ .long 0x38297C10 # 0.000040408282 10
+ .long 0x3815B666 # 0.000035694240 11
+ .long 0x38183854 # 0.000036292084 12
+ .long 0x38448108 # 0.000046850211 13
+ .long 0x373539E9 # 0.000010801924 14
+ .long 0x3864A740 # 0.000054515200 15
+ .long 0x387BE3CD # 0.000060055219 16
+ .long 0x3803B715 # 0.000031403342 17
+ .long 0x380C36AF # 0.000033429529 18
+ .long 0x3892713A # 0.000069829126 19
+ .long 0x38AE55D6 # 0.000083129547 20
+ .long 0x38A0FDE8 # 0.000076766883 21
+ .long 0x3862BAE1 # 0.000054056643 22
+ .long 0x3798AAD3 # 0.000018199358 23
+ .long 0x38C5E10E # 0.000094356117 24
+ .long 0x382D872E # 0.000041372310 25
+ .long 0x38DEDFAC # 0.000106274470 26
+ .long 0x38481E9B # 0.000047712219 27
+ .long 0x38EBFB5E # 0.000112524940 28
+ .long 0x38783B83 # 0.000059183232 29
+ .long 0x374E1B05 # 0.000012284848 30
+ .long 0x38CA0E11 # 0.000096347307 31
+ .long 0x3891F660 # 0.000069600297 32
+ .long 0x386C9A9A # 0.000056410769 33
+ .long 0x38777BCD # 0.000059004688 34
+ .long 0x38A6CED4 # 0.000079540216 35
+ .long 0x38FBE3CD # 0.000120110439 36
+ .long 0x387E7E01 # 0.000060675669 37
+ .long 0x37D40984 # 0.000025276800 38
+ .long 0x3784C3AD # 0.000015826745 39
+ .long 0x380F5FAF # 0.000034182969 40
+ .long 0x38AC47BC # 0.000082149607 41
+ .long 0x392952D3 # 0.000161479504 42
+ .long 0x37F97073 # 0.000029735476 43
+ .long 0x3865C84A # 0.000054784388 44
+ .long 0x3979CF17 # 0.000238236375 45
+ .long 0x38C3D2F5 # 0.000093376184 46
+ .long 0x38E6B468 # 0.000110008579 47
+ .long 0x383EBCE1 # 0.000045475437 48
+ .long 0x39186BDF # 0.000145360347 49
+ .long 0x392F0945 # 0.000166927537 50
+ .long 0x38E9ED45 # 0.000111545007 51
+ .long 0x396B99A8 # 0.000224685878 52
+ .long 0x37A27674 # 0.000019367064 53
+ .long 0x397069AB # 0.000229275480 54
+ .long 0x39013539 # 0.000123222257 55
+ .long 0x3947F423 # 0.000190690669 56
+ .long 0x3945E10E # 0.000188712234 57
+ .long 0x38F85DB0 # 0.000118430122 58
+ .long 0x396C08DC # 0.000225100142 59
+ .long 0x37B4996F # 0.000021529120 60
+ .long 0x397CEADA # 0.000241200818 61
+ .long 0x3920261B # 0.000152729845 62
+ .long 0x35AA4906 # 0.000001268724 63
+ .long 0x3805FDF4 # 0.000031946183 64
+ .long 0 # for alignment
+
diff --git a/src/gas/vrsalogf.S b/src/gas/vrsalogf.S
new file mode 100644
index 0000000..1f96523
--- /dev/null
+++ b/src/gas/vrsalogf.S
@@ -0,0 +1,1088 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsalogf.s
+#
+# An array implementation of the logf libm function.
+#
+# Prototype:
+#
+# void vrsa_logf(int n, float *x, float *y);
+#
+# Computes the natural log of x.
+# Places the results into the supplied y array.
+# Does not perform error handling, but does return C99 values for error
+# inputs. Denormal results are truncated to 0.
+
+# This array version is basically a unrolling of the by4 scalar single
+# routine. The second set of operations is performed by the indented
+# instructions interleaved into the first set.
+#
+#
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+ .weak vrsa_logf_
+ .set vrsa_logf_,__vrsa_logf__
+ .weak vrsa_logf__
+ .set vrsa_logf__,__vrsa_logf__
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array logf
+#** VRSA_LOGF(N,X,Y)
+# C equivalent*/
+#void vrsa_logf__(int * n, float *x, float *y)
+#{
+# vrsa_logf(*n,x,y);
+#}
+.globl __vrsa_logf__
+ .type __vrsa_logf__,@function
+__vrsa_logf__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+
+
+# define local variable storage offsets
+.equ p_x,0 # save x
+.equ p_idx,0x010 # xmmword index
+.equ p_z1,0x020 # xmmword index
+.equ p_q,0x030 # xmmword index
+.equ p_corr,0x040 # xmmword index
+.equ p_omask,0x050 # xmmword index
+
+
+.equ p_x2,0x0100 # save x
+.equ p_idx2,0x0110 # xmmword index
+.equ p_z12,0x0120 # xmmword index
+.equ p_q2,0x0130 # xmmword index
+
+.equ save_xa,0x0140 #qword
+.equ save_ya,0x0148 #qword
+.equ save_nv,0x0150 #qword
+.equ p_iter,0x0158 # qword storage for number of loop iterations
+
+.equ save_rbx,0x0160 #
+.equ save_rdi,0x0168 #qword
+.equ save_rsi,0x0170 #qword
+
+.equ p2_temp,0x0180 #qword
+.equ p2_temp1,0x01a0 #qword
+
+.equ stack_size,0x01c8
+
+
+
+
+# parameters are passed in by gcc as:
+# rdi - int n
+# rsi - double *x
+# rdx - double *y
+
+.globl vrsa_logf
+ .type vrsa_logf,@function
+vrsa_logf:
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rdi
+
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $3,%rax # get number of iterations
+ jz .L__vsa_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $3,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+# In this second version, process the array 8 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movups (%rsi),%xmm0
+ movups 16(%rsi),%xmm12
+# movhps .LQWORD,%xmm0 PTR [rsi+8]
+ prefetch 64(%rsi)
+ add $32,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+
+# check e as a special case
+ movdqa %xmm0,p_x(%rsp) # save x
+ movdqa %xmm12,p_x2(%rsp) # save x
+ movdqa %xmm0,%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movmskps %xmm2,%r9d
+
+ movdqa %xmm12,%xmm9
+ movaps %xmm12,%xmm7
+
+#
+# compute the index into the log tables
+#
+ movdqa %xmm0,%xmm3
+ movaps %xmm0,%xmm1
+ psrld $23,%xmm3
+
+ #
+ # compute the index into the log tables
+ #
+ psrld $23,%xmm9
+ subps .L__real_one(%rip),%xmm7
+ psubd .L__mask_127(%rip),%xmm9
+ subps .L__real_one(%rip),%xmm1
+ psubd .L__mask_127(%rip),%xmm3
+ cvtdq2ps %xmm9,%xmm13 # xexp
+
+ movdqa %xmm12,%xmm9
+ pand .L__real_mant(%rip),%xmm9
+ xor %r8,%r8
+ movdqa %xmm9,%xmm8
+ movaps .L__real_half(%rip),%xmm11 # .5
+ cvtdq2ps %xmm3,%xmm6 # xexp
+
+ movdqa %xmm0,%xmm3
+ pand .L__real_mant(%rip),%xmm3
+ xor %r8,%r8
+ movdqa %xmm3,%xmm2
+ movaps .L__real_half(%rip),%xmm5 # .5
+
+#/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ psrld $16,%xmm3
+ lea .L__np_ln_lead_table(%rip),%rdx
+ movdqa %xmm3,%xmm4
+ psrld $16,%xmm9
+ movdqa %xmm9,%xmm10
+ psrld $1,%xmm9
+ psrld $1,%xmm3
+ paddd .L__mask_040(%rip),%xmm3
+ pand .L__mask_001(%rip),%xmm4
+ paddd %xmm4,%xmm3
+ cvtdq2ps %xmm3,%xmm1
+ #/* Now x = 2**xexp * f, 1/2 <= f < 1. */
+ paddd .L__mask_040(%rip),%xmm9
+ pand .L__mask_001(%rip),%xmm10
+ paddd %xmm10,%xmm9
+ cvtdq2ps %xmm9,%xmm7
+ packssdw %xmm3,%xmm3
+ movq %xmm3,p_idx(%rsp)
+ packssdw %xmm9,%xmm9
+ movq %xmm9,p_idx2(%rsp)
+
+
+# reduce and get u
+ movdqa %xmm0,%xmm3
+ orps .L__real_half(%rip),%xmm2
+
+
+ mulps .L__real_3c000000(%rip),%xmm1 # f1 = index/128
+ # reduce and get u
+
+
+ subps %xmm1,%xmm2 # f2 = f - f1
+ mulps %xmm2,%xmm5
+ addps %xmm5,%xmm1
+
+ divps %xmm1,%xmm2 # u
+
+ movdqa %xmm12,%xmm9
+ orps .L__real_half(%rip),%xmm8
+
+
+ mulps .L__real_3c000000(%rip),%xmm7 # f1 = index/128
+ subps %xmm7,%xmm8 # f2 = f - f1
+ mulps %xmm8,%xmm11
+ addps %xmm11,%xmm7
+
+
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1(%rsp) # save the f1 values
+
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z1+8(%rsp) # save the f1 value
+
+ divps %xmm7,%xmm8 # u
+ lea .L__np_ln_lead_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12(%rsp) # save the f1 values
+
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f1 value
+
+ mov %cx,%r8w
+ ror $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f1 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_z12+8(%rsp) # save the f1 value
+
+# solve for ln(1+u)
+ movaps %xmm2,%xmm1 # u
+ mulps %xmm2,%xmm2 # u^2
+ movaps %xmm2,%xmm5
+ movaps .L__real_cb3(%rip),%xmm3
+ mulps %xmm2,%xmm3 #Cu2
+ mulps %xmm1,%xmm5 # u^3
+ addps .L__real_cb2(%rip),%xmm3 #B+Cu2
+ movaps %xmm2,%xmm4
+ mulps %xmm5,%xmm4 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm2
+
+ mulps .L__real_cb1(%rip),%xmm5 #Au3
+ addps %xmm5,%xmm1 # u+Au3
+ mulps %xmm3,%xmm4 # u5(B+Cu2)
+
+ lea .L__np_ln_tail_table(%rip),%rdx
+ addps %xmm4,%xmm1 # poly
+
+# recombine
+ mov p_idx(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q(%rsp) # save the f2 value
+
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q+8(%rsp) # save the f2 value
+
+ addps p_q(%rsp),%xmm1 #z2 +=q
+
+ movaps p_z1(%rsp),%xmm0 # z1 values
+
+ mulps %xmm6,%xmm2
+ addps %xmm2,%xmm0 #r1
+ mulps .L__real_log2_tail(%rip),%xmm6
+ addps %xmm6,%xmm1 #r2
+ addps %xmm1,%xmm0
+
+
+
+# check for e
+ test $0x0f,%r9d
+ jnz .L__vlogf_e
+.L__f1:
+
+# check for negative numbers or zero
+ xorps %xmm1,%xmm1
+ cmpps $1,p_x(%rsp),%xmm1 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm1,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg
+
+.L__f2:
+## if +inf
+ movaps p_x(%rsp),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf
+.L__f3:
+
+ movaps p_x(%rsp),%xmm3
+ subps .L__real_one(%rip),%xmm3
+ andps .L__real_notsign(%rip),%xmm3
+ cmpps $2,.L__real_threshold(%rip),%xmm3
+ movmskps %xmm3,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one
+.L__f4:
+
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movups %xmm0,(%rdi)
+
+# finish the second set of calculations
+
+ # solve for ln(1+u)
+ movaps %xmm8,%xmm7 # u
+ mulps %xmm8,%xmm8 # u^2
+ movaps %xmm8,%xmm11
+
+ movaps .L__real_cb3(%rip),%xmm9
+ mulps %xmm8,%xmm9 #Cu2
+ mulps %xmm7,%xmm11 # u^3
+ addps .L__real_cb2(%rip),%xmm9 #B+Cu2
+ movaps %xmm8,%xmm10
+ mulps %xmm11,%xmm10 # u^5
+ movaps .L__real_log2_lead(%rip),%xmm8
+
+ mulps .L__real_cb1(%rip),%xmm11 #Au3
+ addps %xmm11,%xmm7 # u+Au3
+ mulps %xmm9,%xmm10 # u5(B+Cu2)
+ addps %xmm10,%xmm7 # poly
+
+
+ # recombine
+ lea .L__np_ln_tail_table(%rip),%rdx
+ mov p_idx2(%rsp),%rcx # get the indexes
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ or -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2(%rsp) # save the f2 value
+
+
+ mov %cx,%r8w
+ shr $16,%rcx
+ mov -256(%rdx,%r8,4),%eax # get the f2 value
+
+ mov %cx,%r8w
+ mov -256(%rdx,%r8,4),%ebx # get the f2 value
+ shl $32,%rbx
+ or %rbx,%rax
+ mov %rax,p_q2+8(%rsp) # save the f2 value
+
+ addps p_q2(%rsp),%xmm7 #z2 +=q
+ movaps p_z12(%rsp),%xmm1 # z1 values
+
+ mulps %xmm13,%xmm8
+ addps %xmm8,%xmm1 #r1
+ mulps .L__real_log2_tail(%rip),%xmm13
+ addps %xmm13,%xmm7 #r2
+ addps %xmm7,%xmm1
+
+ # check e as a special case
+ movaps p_x2(%rsp),%xmm10
+ cmpps $0,.L__real_ef(%rip),%xmm10
+ movmskps %xmm10,%r9d
+ # check for e
+ test $0x0f,%r9d
+ jnz .L__vlogf_e2
+.L__f12:
+
+ # check for negative numbers or zero
+ xorps %xmm7,%xmm7
+ cmpps $1,p_x2(%rsp),%xmm7 # 0 greater than =?. catches NaNs also.
+ movmskps %xmm7,%r9d
+ cmp $0x0f,%r9d
+ jnz .L__z_or_neg2
+
+.L__f22:
+ ## if +inf
+ movaps p_x2(%rsp),%xmm9
+ cmpps $0,.L__real_inf(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__log_inf2
+.L__f32:
+
+ movaps p_x2(%rsp),%xmm9
+ subps .L__real_one(%rip),%xmm9
+ andps .L__real_notsign(%rip),%xmm9
+ cmpps $2,.L__real_threshold(%rip),%xmm9
+ movmskps %xmm9,%r9d
+ test $0x0f,%r9d
+ jnz .L__near_one2
+.L__f42:
+
+
+ prefetch 64(%rsi)
+ add $32,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+# store the result _m128d
+ movups %xmm1,-16(%rdi)
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vsa_top
+
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vsa_cleanup
+
+
+#
+.L__final_check:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+
+
+ .align 16
+# we jump here when we have an odd number of log calls to make at the
+# end
+.L__vsa_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+ xorps %xmm0,%xmm0
+ movaps %xmm0,p2_temp(%rsp)
+ movaps %xmm0,p2_temp+16(%rsp)
+
+ mov (%rsi),%ecx # we know there's at least one
+ mov %ecx,p2_temp(%rsp)
+ cmp $2,%rax
+ jl .L__vsacg
+
+ mov 4(%rsi),%ecx # do the second value
+ mov %ecx,p2_temp+4(%rsp)
+ cmp $3,%rax
+ jl .L__vsacg
+
+ mov 8(%rsi),%ecx # do the third value
+ mov %ecx,p2_temp+8(%rsp)
+ cmp $4,%rax
+ jl .L__vsacg
+
+ mov 12(%rsi),%ecx # do the fourth value
+ mov %ecx,p2_temp+12(%rsp)
+ cmp $5,%rax
+ jl .L__vsacg
+
+ mov 16(%rsi),%ecx # do the fifth value
+ mov %ecx,p2_temp+16(%rsp)
+ cmp $6,%rax
+ jl .L__vsacg
+
+ mov 20(%rsi),%ecx # do the sixth value
+ mov %ecx,p2_temp+20(%rsp)
+ cmp $7,%rax
+ jl .L__vsacg
+
+ mov 24(%rsi),%ecx # do the last value
+ mov %ecx,p2_temp+24(%rsp)
+
+.L__vsacg:
+ mov $8,%rdi # parameter for N
+ lea p2_temp(%rsp),%rsi # &x parameter
+ lea p2_temp1(%rsp),%rdx # &y parameter
+ call vrsa_logf@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+ mov p2_temp1(%rsp),%ecx
+ mov %ecx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+4(%rsp),%ecx
+ mov %ecx,4(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+8(%rsp),%ecx
+ mov %ecx,8(%rdi) # do the second value
+ cmp $4,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+12(%rsp),%ecx
+ mov %ecx,12(%rdi) # do the second value
+ cmp $5,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+16(%rsp),%ecx
+ mov %ecx,16(%rdi) # do the second value
+ cmp $6,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+20(%rsp),%ecx
+ mov %ecx,20(%rdi) # do the second value
+ cmp $7,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+24(%rsp),%ecx
+ mov %ecx,24(%rdi) # do the last value
+
+.L__vsacgf:
+ jmp .L__final_check
+
+
+.L__vlogf_e:
+ movdqa p_x(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ jmp .L__f1
+
+.L__vlogf_e2:
+ movdqa p_x2(%rsp),%xmm2
+ cmpps $0,.L__real_ef(%rip),%xmm2
+ movdqa %xmm2,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-e values
+ andps .L__real_one(%rip),%xmm2 # setup the 1 values
+ orps %xmm3,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ jmp .L__f12
+
+ .align 16
+.L__near_one:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm3,p_omask(%rsp) # save ones mask
+ movaps p_x(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+# u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm1
+ divps %xmm2,%xmm1 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+# correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm1,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+# u = u + u;
+ addps %xmm1,%xmm1 #u
+ movaps %xmm1,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+# r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm1,%xmm5 # Cu
+ movaps %xmm1,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm1
+ mulps %xmm1,%xmm1 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm1 #u6(Cu+Du3)
+ addps %xmm1,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+# return r + r2;
+ addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm0,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+
+ jmp .L__f4
+
+
+ .align 16
+.L__near_one2:
+# saves 10 cycles
+# r = x - 1.0;
+ movdqa %xmm9,p_omask(%rsp) # save ones mask
+ movaps p_x2(%rsp),%xmm3
+ movaps .L__real_two(%rip),%xmm2
+ subps .L__real_one(%rip),%xmm3 # r
+ # u = r / (2.0 + r);
+ addps %xmm3,%xmm2
+ movaps %xmm3,%xmm7
+ divps %xmm2,%xmm7 # u
+ movaps .L__real_ca4(%rip),%xmm4 #D
+ movaps .L__real_ca3(%rip),%xmm5 #C
+ # correction = r * u;
+ movaps %xmm3,%xmm6
+ mulps %xmm7,%xmm6 # correction
+ movdqa %xmm6,p_corr(%rsp) # save correction
+ # u = u + u;
+ addps %xmm7,%xmm7 #u
+ movaps %xmm7,%xmm2
+ mulps %xmm2,%xmm2 #v =u^2
+ # r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ mulps %xmm7,%xmm5 # Cu
+ movaps %xmm7,%xmm6
+ mulps %xmm2,%xmm6 # u^3
+ mulps .L__real_ca2(%rip),%xmm2 #Bu^2
+ mulps %xmm6,%xmm4 #Du^3
+
+ addps .L__real_ca1(%rip),%xmm2 # +A
+ movaps %xmm6,%xmm7
+ mulps %xmm7,%xmm7 # u^6
+ addps %xmm4,%xmm5 #Cu+Du3
+
+ mulps %xmm6,%xmm2 #u3(A+Bu2)
+ mulps %xmm5,%xmm7 #u6(Cu+Du3)
+ addps %xmm7,%xmm2
+ subps p_corr(%rsp),%xmm2 # -correction
+
+ # return r + r2;
+ addps %xmm2,%xmm3
+
+ movdqa p_omask(%rsp),%xmm6
+ movdqa %xmm6,%xmm2
+ andnps %xmm1,%xmm6 # keep the non-nearone values
+ andps %xmm3,%xmm2 # setup the nearone values
+ orps %xmm6,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+
+ jmp .L__f42
+
+# we have a zero, a negative number, or both.
+# the mask is already in .LNaNs,%xmm1 are also picked up here, along with -inf.
+.L__z_or_neg:
+# deal with negatives first
+ movdqa %xmm1,%xmm3
+ andps %xmm0,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm1 # setup the nan values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+# check for +/- 0
+ xorps %xmm1,%xmm1
+ cmpps $0,p_x(%rsp),%xmm1 # 0 ?.
+ movmskps %xmm1,%r9d
+ test $0x0f,%r9d
+ jz .L__zn2
+
+ movdqa %xmm1,%xmm3
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm1 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+
+.L__zn2:
+# check for NaNs
+ movaps p_x(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x(%rsp),%xmm1 # isolate the NaNs
+ pand %xmm4,%xmm1
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm1,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm1
+ andnps %xmm0,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm0 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f2
+
+# handle only +inf log(+inf) = inf
+.L__log_inf:
+ movdqa %xmm3,%xmm1
+ andnps %xmm0,%xmm3 # keep the non-error values
+ andps p_x(%rsp),%xmm1 # setup the +inf values
+ orps %xmm3,%xmm1 # merge
+ movdqa %xmm1,%xmm0 # and replace
+ jmp .L__f3
+
+
+.L__z_or_neg2:
+ # deal with negatives first
+ movdqa %xmm7,%xmm3
+ andps %xmm1,%xmm3 # keep the non-error values
+ andnps .L__real_nan(%rip),%xmm7 # setup the nan values
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ # check for +/- 0
+ xorps %xmm7,%xmm7
+ cmpps $0,p_x2(%rsp),%xmm7 # 0 ?.
+ movmskps %xmm7,%r9d
+ test $0x0f,%r9d
+ jz .L__zn22
+
+ movdqa %xmm7,%xmm3
+ andnps %xmm1,%xmm3 # keep the non-error values
+ andps .L__real_ninf(%rip),%xmm7 # ; C99 specs -inf for +-0
+ orps %xmm3,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+
+.L__zn22:
+ # check for NaNs
+ movaps p_x2(%rsp),%xmm3
+ andps .L__real_inf(%rip),%xmm3
+ cmpps $0,.L__real_inf(%rip),%xmm3 # mask for max exponent
+
+ movdqa p_x2(%rsp),%xmm4
+ pand .L__real_mant(%rip),%xmm4 # mask for non-zero mantissa
+ pcmpeqd .L__real_zero(%rip),%xmm4
+ pandn %xmm3,%xmm4 # mask for NaNs
+ movdqa %xmm4,%xmm2
+ movdqa p_x2(%rsp),%xmm7 # isolate the NaNs
+ pand %xmm4,%xmm7
+
+ pand .L__real_qnanbit(%rip),%xmm4 # now we have a mask that will set QNaN bit
+ por %xmm7,%xmm4 # turn SNaNs to QNaNs
+
+ movdqa %xmm2,%xmm7
+ andnps %xmm1,%xmm2 # keep the non-error values
+ orps %xmm4,%xmm2 # merge
+ movdqa %xmm2,%xmm1 # and replace
+ xorps %xmm4,%xmm4
+
+ jmp .L__f22
+
+ # handle only +inf log(+inf) = inf
+.L__log_inf2:
+ movdqa %xmm9,%xmm7
+ andnps %xmm1,%xmm9 # keep the non-error values
+ andps p_x2(%rsp),%xmm7 # setup the +inf values
+ orps %xmm9,%xmm7 # merge
+ movdqa %xmm7,%xmm1 # and replace
+ jmp .L__f32
+
+
+ .data
+ .align 64
+
+
+.L__real_zero: .quad 0x00000000000000000 # 1.0
+ .quad 0x00000000000000000
+.L__real_one: .quad 0x03f8000003f800000 # 1.0
+ .quad 0x03f8000003f800000
+.L__real_two: .quad 0x04000000040000000 # 1.0
+ .quad 0x04000000040000000
+.L__real_ninf: .quad 0x0ff800000ff800000 # -inf
+ .quad 0x0ff800000ff800000
+.L__real_inf: .quad 0x07f8000007f800000 # +inf
+ .quad 0x07f8000007f800000
+.L__real_nan: .quad 0x07fc000007fc00000 # NaN
+ .quad 0x07fc000007fc00000
+.L__real_ef: .quad 0x0402DF854402DF854 # float e
+ .quad 0x0402DF854402DF854
+
+.L__real_sign: .quad 0x08000000080000000 # sign bit
+ .quad 0x08000000080000000
+.L__real_notsign: .quad 0x07ffFFFFF7ffFFFFF # ^sign bit
+ .quad 0x07ffFFFFF7ffFFFFF
+.L__real_qnanbit: .quad 0x00040000000400000 # quiet nan bit
+ .quad 0x00040000000400000
+.L__real_mant: .quad 0x0007FFFFF007FFFFF # mantipsa bits
+ .quad 0x0007FFFFF007FFFFF
+.L__real_3c000000: .quad 0x03c0000003c000000 # /* 0.0078125 = 1/128 */
+ .quad 0x03c0000003c000000
+.L__mask_127: .quad 0x00000007f0000007f #
+ .quad 0x00000007f0000007f
+.L__mask_040: .quad 0x00000004000000040 #
+ .quad 0x00000004000000040
+.L__mask_001: .quad 0x00000000100000001 #
+ .quad 0x00000000100000001
+
+
+.L__real_threshold: .quad 0x03CF5C28F3CF5C28F # .03
+ .quad 0x03CF5C28F3CF5C28F
+
+.L__real_ca1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333317923934e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_ca2: .quad 0x03C4CCCCD3C4CCCCD # 1.25000000037717509602e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_ca3: .quad 0x03B1249183B124918 # 2.23213998791944806202e-03
+ .quad 0x03B1249183B124918
+.L__real_ca4: .quad 0x039E401A639E401A6 # 4.34887777707614552256e-04
+ .quad 0x039E401A639E401A6
+.L__real_cb1: .quad 0x03DAAAAAB3DAAAAAB # 8.33333333333333593622e-02
+ .quad 0x03DAAAAAB3DAAAAAB
+.L__real_cb2: .quad 0x03C4CCCCD3C4CCCCD # 1.24999999978138668903e-02
+ .quad 0x03C4CCCCD3C4CCCCD
+.L__real_cb3: .quad 0x03B124A123B124A12 # 2.23219810758559851206e-03
+ .quad 0x03B124A123B124A12
+.L__real_log2_lead: .quad 0x03F3170003F317000 # 0.693115234375
+ .quad 0x03F3170003F317000
+.L__real_log2_tail: .quad 0x03805FDF43805FDF4 # 0.000031946183
+ .quad 0x03805FDF43805FDF4
+.L__real_half: .quad 0x03f0000003f000000 # 1/2
+ .quad 0x03f0000003f000000
+
+
+.L__np_ln__table:
+ .quad 0x0000000000000000 # 0.00000000000000000000e+00
+ .quad 0x3F8FC0A8B0FC03E4 # 1.55041813850402832031e-02
+ .quad 0x3F9F829B0E783300 # 3.07716131210327148438e-02
+ .quad 0x3FA77458F632DCFC # 4.58095073699951171875e-02
+ .quad 0x3FAF0A30C01162A6 # 6.06245994567871093750e-02
+ .quad 0x3FB341D7961BD1D1 # 7.52233862876892089844e-02
+ .quad 0x3FB6F0D28AE56B4C # 8.96121263504028320312e-02
+ .quad 0x3FBA926D3A4AD563 # 1.03796780109405517578e-01
+ .quad 0x3FBE27076E2AF2E6 # 1.17783010005950927734e-01
+ .quad 0x3FC0D77E7CD08E59 # 1.31576299667358398438e-01
+ .quad 0x3FC29552F81FF523 # 1.45181953907012939453e-01
+ .quad 0x3FC44D2B6CCB7D1E # 1.58604979515075683594e-01
+ .quad 0x3FC5FF3070A793D4 # 1.71850204467773437500e-01
+ .quad 0x3FC7AB890210D909 # 1.84922337532043457031e-01
+ .quad 0x3FC9525A9CF456B4 # 1.97825729846954345703e-01
+ .quad 0x3FCAF3C94E80BFF3 # 2.10564732551574707031e-01
+ .quad 0x3FCC8FF7C79A9A22 # 2.23143517971038818359e-01
+ .quad 0x3FCE27076E2AF2E6 # 2.35566020011901855469e-01
+ .quad 0x3FCFB9186D5E3E2B # 2.47836112976074218750e-01
+ .quad 0x3FD0A324E27390E3 # 2.59957492351531982422e-01
+ .quad 0x3FD1675CABABA60E # 2.71933674812316894531e-01
+ .quad 0x3FD22941FBCF7966 # 2.83768117427825927734e-01
+ .quad 0x3FD2E8E2BAE11D31 # 2.95464158058166503906e-01
+ .quad 0x3FD3A64C556945EA # 3.07025015354156494141e-01
+ .quad 0x3FD4618BC21C5EC2 # 3.18453729152679443359e-01
+ .quad 0x3FD51AAD872DF82D # 3.29753279685974121094e-01
+ .quad 0x3FD5D1BDBF5809CA # 3.40926527976989746094e-01
+ .quad 0x3FD686C81E9B14AF # 3.51976394653320312500e-01
+ .quad 0x3FD739D7F6BBD007 # 3.62905442714691162109e-01
+ .quad 0x3FD7EAF83B82AFC3 # 3.73716354370117187500e-01
+ .quad 0x3FD89A3386C1425B # 3.84411692619323730469e-01
+ .quad 0x3FD947941C2116FB # 3.94993782043457031250e-01
+ .quad 0x3FD9F323ECBF984C # 4.05465066432952880859e-01
+ .quad 0x3FDA9CEC9A9A084A # 4.15827870368957519531e-01
+ .quad 0x3FDB44F77BCC8F63 # 4.26084339618682861328e-01
+ .quad 0x3FDBEB4D9DA71B7C # 4.36236739158630371094e-01
+ .quad 0x3FDC8FF7C79A9A22 # 4.46287095546722412109e-01
+ .quad 0x3FDD32FE7E00EBD5 # 4.56237375736236572266e-01
+ .quad 0x3FDDD46A04C1C4A1 # 4.66089725494384765625e-01
+ .quad 0x3FDE744261D68788 # 4.75845873355865478516e-01
+ .quad 0x3FDF128F5FAF06ED # 4.85507786273956298828e-01
+ .quad 0x3FDFAF588F78F31F # 4.95077252388000488281e-01
+ .quad 0x3FE02552A5A5D0FF # 5.04556000232696533203e-01
+ .quad 0x3FE0723E5C1CDF40 # 5.13945698738098144531e-01
+ .quad 0x3FE0BE72E4252A83 # 5.23248136043548583984e-01
+ .quad 0x3FE109F39E2D4C97 # 5.32464742660522460938e-01
+ .quad 0x3FE154C3D2F4D5EA # 5.41597247123718261719e-01
+ .quad 0x3FE19EE6B467C96F # 5.50647079944610595703e-01
+ .quad 0x3FE1E85F5E7040D0 # 5.59615731239318847656e-01
+ .quad 0x3FE23130D7BEBF43 # 5.68504691123962402344e-01
+ .quad 0x3FE2795E1289B11B # 5.77315330505371093750e-01
+ .quad 0x3FE2C0E9ED448E8C # 5.86049020290374755859e-01
+ .quad 0x3FE307D7334F10BE # 5.94707071781158447266e-01
+ .quad 0x3FE34E289D9CE1D3 # 6.03290796279907226562e-01
+ .quad 0x3FE393E0D3562A1A # 6.11801505088806152344e-01
+ .quad 0x3FE3D9026A7156FB # 6.20240390300750732422e-01
+ .quad 0x3FE41D8FE84672AE # 6.28608644008636474609e-01
+ .quad 0x3FE4618BC21C5EC2 # 6.36907458305358886719e-01
+ .quad 0x3FE4A4F85DB03EBB # 6.45137906074523925781e-01
+ .quad 0x3FE4E7D811B75BB1 # 6.53301239013671875000e-01
+ .quad 0x3FE52A2D265BC5AB # 6.61398470401763916016e-01
+ .quad 0x3FE56BF9D5B3F399 # 6.69430613517761230469e-01
+ .quad 0x3FE5AD404C359F2D # 6.77398800849914550781e-01
+ .quad 0x3FE5EE02A9241675 # 6.85303986072540283203e-01
+ .quad 0x3FE62E42FEFA39EF # 6.93147122859954833984e-01
+ .quad 0 # for alignment
+
+.L__np_ln_lead_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x3C7E0000 # 0.015502929688 1
+ .long 0x3CFC1000 # 0.030769348145 2
+ .long 0x3D3BA000 # 0.045806884766 3
+ .long 0x3D785000 # 0.060623168945 4
+ .long 0x3D9A0000 # 0.075195312500 5
+ .long 0x3DB78000 # 0.089599609375 6
+ .long 0x3DD49000 # 0.103790283203 7
+ .long 0x3DF13000 # 0.117767333984 8
+ .long 0x3E06B000 # 0.131530761719 9
+ .long 0x3E14A000 # 0.145141601563 10
+ .long 0x3E226000 # 0.158569335938 11
+ .long 0x3E2FF000 # 0.171813964844 12
+ .long 0x3E3D5000 # 0.184875488281 13
+ .long 0x3E4A9000 # 0.197814941406 14
+ .long 0x3E579000 # 0.210510253906 15
+ .long 0x3E647000 # 0.223083496094 16
+ .long 0x3E713000 # 0.235534667969 17
+ .long 0x3E7DC000 # 0.247802734375 18
+ .long 0x3E851000 # 0.259887695313 19
+ .long 0x3E8B3000 # 0.271850585938 20
+ .long 0x3E914000 # 0.283691406250 21
+ .long 0x3E974000 # 0.295410156250 22
+ .long 0x3E9D3000 # 0.307006835938 23
+ .long 0x3EA30000 # 0.318359375000 24
+ .long 0x3EA8D000 # 0.329711914063 25
+ .long 0x3EAE8000 # 0.340820312500 26
+ .long 0x3EB43000 # 0.351928710938 27
+ .long 0x3EB9C000 # 0.362792968750 28
+ .long 0x3EBF5000 # 0.373657226563 29
+ .long 0x3EC4D000 # 0.384399414063 30
+ .long 0x3ECA3000 # 0.394897460938 31
+ .long 0x3ECF9000 # 0.405395507813 32
+ .long 0x3ED4E000 # 0.415771484375 33
+ .long 0x3EDA2000 # 0.426025390625 34
+ .long 0x3EDF5000 # 0.436157226563 35
+ .long 0x3EE47000 # 0.446166992188 36
+ .long 0x3EE99000 # 0.456176757813 37
+ .long 0x3EEEA000 # 0.466064453125 38
+ .long 0x3EF3A000 # 0.475830078125 39
+ .long 0x3EF89000 # 0.485473632813 40
+ .long 0x3EFD7000 # 0.494995117188 41
+ .long 0x3F012000 # 0.504394531250 42
+ .long 0x3F039000 # 0.513916015625 43
+ .long 0x3F05F000 # 0.523193359375 44
+ .long 0x3F084000 # 0.532226562500 45
+ .long 0x3F0AA000 # 0.541503906250 46
+ .long 0x3F0CF000 # 0.550537109375 47
+ .long 0x3F0F4000 # 0.559570312500 48
+ .long 0x3F118000 # 0.568359375000 49
+ .long 0x3F13C000 # 0.577148437500 50
+ .long 0x3F160000 # 0.585937500000 51
+ .long 0x3F183000 # 0.594482421875 52
+ .long 0x3F1A7000 # 0.603271484375 53
+ .long 0x3F1C9000 # 0.611572265625 54
+ .long 0x3F1EC000 # 0.620117187500 55
+ .long 0x3F20E000 # 0.628417968750 56
+ .long 0x3F230000 # 0.636718750000 57
+ .long 0x3F252000 # 0.645019531250 58
+ .long 0x3F273000 # 0.653076171875 59
+ .long 0x3F295000 # 0.661376953125 60
+ .long 0x3F2B5000 # 0.669189453125 61
+ .long 0x3F2D6000 # 0.677246093750 62
+ .long 0x3F2F7000 # 0.685302734375 63
+ .long 0x3F317000 # 0.693115234375 64
+ .long 0 # for alignment
+
+.L__np_ln_tail_table:
+ .long 0x00000000 # 0.000000000000 0
+ .long 0x35A8B0FC # 0.000001256848 1
+ .long 0x361B0E78 # 0.000002310522 2
+ .long 0x3631EC66 # 0.000002651266 3
+ .long 0x35C30046 # 0.000001452871 4
+ .long 0x37EBCB0E # 0.000028108738 5
+ .long 0x37528AE5 # 0.000012549314 6
+ .long 0x36DA7496 # 0.000006510479 7
+ .long 0x3783B715 # 0.000015701671 8
+ .long 0x383F3E68 # 0.000045596069 9
+ .long 0x38297C10 # 0.000040408282 10
+ .long 0x3815B666 # 0.000035694240 11
+ .long 0x38183854 # 0.000036292084 12
+ .long 0x38448108 # 0.000046850211 13
+ .long 0x373539E9 # 0.000010801924 14
+ .long 0x3864A740 # 0.000054515200 15
+ .long 0x387BE3CD # 0.000060055219 16
+ .long 0x3803B715 # 0.000031403342 17
+ .long 0x380C36AF # 0.000033429529 18
+ .long 0x3892713A # 0.000069829126 19
+ .long 0x38AE55D6 # 0.000083129547 20
+ .long 0x38A0FDE8 # 0.000076766883 21
+ .long 0x3862BAE1 # 0.000054056643 22
+ .long 0x3798AAD3 # 0.000018199358 23
+ .long 0x38C5E10E # 0.000094356117 24
+ .long 0x382D872E # 0.000041372310 25
+ .long 0x38DEDFAC # 0.000106274470 26
+ .long 0x38481E9B # 0.000047712219 27
+ .long 0x38EBFB5E # 0.000112524940 28
+ .long 0x38783B83 # 0.000059183232 29
+ .long 0x374E1B05 # 0.000012284848 30
+ .long 0x38CA0E11 # 0.000096347307 31
+ .long 0x3891F660 # 0.000069600297 32
+ .long 0x386C9A9A # 0.000056410769 33
+ .long 0x38777BCD # 0.000059004688 34
+ .long 0x38A6CED4 # 0.000079540216 35
+ .long 0x38FBE3CD # 0.000120110439 36
+ .long 0x387E7E01 # 0.000060675669 37
+ .long 0x37D40984 # 0.000025276800 38
+ .long 0x3784C3AD # 0.000015826745 39
+ .long 0x380F5FAF # 0.000034182969 40
+ .long 0x38AC47BC # 0.000082149607 41
+ .long 0x392952D3 # 0.000161479504 42
+ .long 0x37F97073 # 0.000029735476 43
+ .long 0x3865C84A # 0.000054784388 44
+ .long 0x3979CF17 # 0.000238236375 45
+ .long 0x38C3D2F5 # 0.000093376184 46
+ .long 0x38E6B468 # 0.000110008579 47
+ .long 0x383EBCE1 # 0.000045475437 48
+ .long 0x39186BDF # 0.000145360347 49
+ .long 0x392F0945 # 0.000166927537 50
+ .long 0x38E9ED45 # 0.000111545007 51
+ .long 0x396B99A8 # 0.000224685878 52
+ .long 0x37A27674 # 0.000019367064 53
+ .long 0x397069AB # 0.000229275480 54
+ .long 0x39013539 # 0.000123222257 55
+ .long 0x3947F423 # 0.000190690669 56
+ .long 0x3945E10E # 0.000188712234 57
+ .long 0x38F85DB0 # 0.000118430122 58
+ .long 0x396C08DC # 0.000225100142 59
+ .long 0x37B4996F # 0.000021529120 60
+ .long 0x397CEADA # 0.000241200818 61
+ .long 0x3920261B # 0.000152729845 62
+ .long 0x35AA4906 # 0.000001268724 63
+ .long 0x3805FDF4 # 0.000031946183 64
+ .long 0 # for alignment
+
diff --git a/src/gas/vrsapowf.S b/src/gas/vrsapowf.S
new file mode 100644
index 0000000..3521a6b
--- /dev/null
+++ b/src/gas/vrsapowf.S
@@ -0,0 +1,782 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsapowf.asm
+#
+# An array implementation of the powf libm function.
+#
+# Prototype:
+#
+# void vrsa_powf(int n, float *x, float *y, float *z);
+#
+# Computes x raised to the y power.
+#
+# Places the results into the supplied z array.
+# Does not perform error handling, but does return C99 values for error
+# inputs. Denormal results are truncated to 0.
+
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+# define local variable storage offsets
+.equ p_temp,0x00 # xmmword
+.equ p_negateres,0x10 # qword
+
+
+.equ save_rbx,0x030 #qword
+
+
+.equ p_ax,0x050 # absolute x
+.equ p_sx,0x060 # sign of x's
+
+.equ p_ay,0x070 # absolute y
+.equ p_yexp,0x080 # unbiased exponent of y
+
+.equ p_inty,0x090 # integer y indicators
+
+.equ p_xptr,0x0a0 # ptr to x values
+.equ p_yptr,0x0a8 # ptr to y values
+.equ p_zptr,0x0b0 # ptr to z values
+
+.equ p_nv,0x0b8 #qword
+.equ p_iter,0x0c0 # qword storage for number of loop iterations
+
+.equ p2_temp,0x0d0 #qword
+.equ p2_temp1,0x0f0 #qword
+
+.equ stack_size,0x0118 # allocate 40h more than
+ # we need to avoid bank conflicts
+
+
+
+
+ .weak vrsa_powf_
+ .set vrsa_powf_,__vrsa_powf__
+ .weak vrsa_powf__
+ .set vrsa_powf__,__vrsa_powf__
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#/* a FORTRAN subroutine implementation of array powf
+#** VRSA_POWF(N,X,Y,Z)
+#** C equivalent
+#*/
+#void vrsa_powf_(int * n, float *x, float *y, float *z)
+#{
+# vrsa_powf(*n,x,y,z);
+#}
+
+.globl __vrsa_powf__
+ .type __vrsa_powf__,@function
+__vrsa_powf__:
+ mov (%rdi),%edi
+
+
+# parameters are passed in by Linux as:
+# edi - int n
+# rsi - float *x
+# rdx - float *y
+# rcx - float *z
+
+.globl vrsa_powf
+ .type vrsa_powf,@function
+vrsa_powf:
+
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+# save the arguments
+ mov %rsi,p_xptr(%rsp) # save pointer to x
+ mov %rdx,p_yptr(%rsp) # save pointer to y
+ mov %rcx,p_zptr(%rsp) # save pointer to z
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+#endif
+
+ mov %rax,%rcx
+ mov %rcx,p_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vsa_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rcx # compute number of extra single calls
+ mov %rcx,p_nv(%rsp) # save number of left over values
+
+# process the array 4 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+# first get x
+ mov p_xptr(%rsp),%rsi # get x_array pointer
+ movups (%rsi),%xmm0
+ prefetch 64(%rsi)
+
+ movaps %xmm0,%xmm2
+ andps .L__mask_nsign(%rip),%xmm0 # get abs x
+ andps .L__mask_sign(%rip),%xmm2 # mask for the sign bits
+ movaps %xmm0,p_ax(%rsp) # save them
+ movaps %xmm2,p_sx(%rsp) # save them
+# convert all four x's to double
+ cvtps2pd p_ax(%rsp),%xmm0
+ cvtps2pd p_ax+8(%rsp),%xmm1
+#
+# classify y
+# vector 32 bit integer method 25 cycles to here
+# /* See whether y is an integer.
+# inty = 0 means not an integer.
+# inty = 1 means odd integer.
+# inty = 2 means even integer.
+# */
+ mov p_yptr(%rsp),%rdi # get y_array pointer
+ movups (%rdi),%xmm4
+ prefetch 64(%rdi)
+ pxor %xmm3,%xmm3
+ pand .L__mask_nsign(%rip),%xmm4 # get abs y in integer format
+ movdqa %xmm4,p_ay(%rsp) # save it
+
+# see if the number is less than 1.0
+ psrld $23,%xmm4 #>> EXPSHIFTBITS_SP32
+
+ psubd .L__mask_127(%rip),%xmm4 # yexp, unbiased exponent
+ movdqa %xmm4,p_yexp(%rsp) # save it
+ paddd .L__mask_1(%rip),%xmm4 # yexp+1
+ pcmpgtd %xmm3,%xmm4 # 0 if exp less than 126 (2^0) (y < 1.0), else FFs
+# xmm4 is ffs if abs(y) >=1.0, else 0
+
+# see if the mantissa has fractional bits
+#build mask for mantissa
+ movdqa .L__mask_23(%rip),%xmm2
+ psubd p_yexp(%rsp),%xmm2 # 24-yexp
+ pmaxsw %xmm3,%xmm2 # no shift counts less than 0
+ movdqa %xmm2,p_temp(%rsp) # save the shift counts
+# create mask for all four values
+# SSE can't individual shifts so have to do 0xeac one seperately
+ mov p_temp(%rsp),%rcx
+ mov $1,%rbx
+ shl %cl,%ebx #1 << (24 - yexp)
+ shr $32,%rcx
+ mov $1,%eax
+ shl %cl,%eax #1 << (24 - yexp)
+ shl $32,%rax
+ add %rax,%rbx
+ mov %rbx,p_temp(%rsp)
+ mov p_temp+8(%rsp),%rcx
+ mov $1,%rbx
+ shl %cl,%ebx #1 << (24 - yexp)
+ shr $32,%rcx
+ mov $1,%eax
+ shl %cl,%eax #1 << (24 - yexp)
+ shl $32,%rax
+ add %rbx,%rax
+ mov %rax,p_temp+8(%rsp)
+ movdqa p_temp(%rsp),%xmm5
+ psubd .L__mask_1(%rip),%xmm5 #= mask = (1 << (24 - yexp)) - 1
+
+# now use the mask to see if there are any fractional bits
+ movdqu (%rdi),%xmm2 # get uy
+ pand %xmm5,%xmm2 # uy & mask
+ pcmpeqd %xmm3,%xmm2 # 0 if not zero (y has fractional mantissa bits), else FFs
+ pand %xmm4,%xmm2 # either 0s or ff
+# xmm2 now accounts for y< 1.0 or y>=1.0 and y has fractional mantissa bits,
+# it has the value 0 if we know it's non-integer or ff if integer.
+
+# now see if it's even or odd.
+
+## if yexp > 24, then it has to be even
+ movdqa .L__mask_24(%rip),%xmm4
+ psubd p_yexp(%rsp),%xmm4 # 24-yexp
+ paddd .L__mask_1(%rip),%xmm5 # mask+1 = least significant integer bit
+ pcmpgtd %xmm3,%xmm4 # if 0, then must be even, else ff's
+
+ pand %xmm4,%xmm5 # set the integer bit mask to zero if yexp>24
+ paddd .L__mask_2(%rip),%xmm4
+ por .L__mask_2(%rip),%xmm4
+ pand %xmm2,%xmm4 # result can be 0, 2, or 3
+
+# now for integer numbers, see if odd or even
+ pand .L__mask_mant(%rip),%xmm5 # mask out exponent bits
+ movdqu (%rdi),%xmm2
+ pand %xmm2,%xmm5 # & uy -> even or odd
+ movdqa .L__float_one(%rip),%xmm2
+ pcmpeqd p_ay(%rsp),%xmm2 # is ay equal to 1, ff's if so, then it's odd
+ pand .L__mask_nsign(%rip),%xmm2 # strip the sign bit so the gt comparison works.
+ por %xmm2,%xmm5
+ pcmpgtd %xmm3,%xmm5 # if odd then ff's, else 0's for even
+ paddd .L__mask_2(%rip),%xmm5 # gives us 2 for even, 1 for odd
+ pand %xmm5,%xmm4
+
+ movdqa %xmm4,p_inty(%rsp) # save inty
+#
+# do more x special case checking
+#
+ movdqa %xmm4,%xmm5
+ pcmpeqd %xmm3,%xmm5 # is not an integer? ff's if so
+ pand .L__mask_NaN(%rip),%xmm5 # these values will be NaNs, if x<0
+ movdqa %xmm4,%xmm2
+ pcmpeqd .L__mask_1(%rip),%xmm2 # is it odd? ff's if so
+ pand .L__mask_sign(%rip),%xmm2 # these values will get their sign bit set
+ por %xmm2,%xmm5
+
+ pcmpeqd p_sx(%rsp),%xmm3 # if the signs are set
+ pandn %xmm5,%xmm3 # then negateres gets the values as shown below
+ movdqa %xmm3,p_negateres(%rsp) # save negateres
+
+# /* p_negateres now means the following.
+# 7FC00000 means x<0, y not an integer, return NaN.
+# 80000000 means x<0, y is odd integer, so set the sign bit.
+## 0 means even integer, and/or x>=0.
+# */
+
+
+# **** Here starts the main calculations ****
+# The algorithm used is x**y = exp(y*log(x))
+# Extra precision is required in intermediate steps to meet the 1ulp requirement
+#
+# log(x) calculation
+ call __vrd4_log@PLT # get the double precision log value
+ # for all four x's
+# y* logx
+# convert all four y's to double
+# mov p_yptr(%rsp),%rdi ; get y_array pointer
+ cvtps2pd (%rdi),%xmm2
+ cvtps2pd 8(%rdi),%xmm3
+
+# /* just multiply by y */
+ mulpd %xmm2,%xmm0
+ mulpd %xmm3,%xmm1
+
+# /* The following code computes r = exp(w) */
+ call __vrd4_exp@PLT # get the double exp value
+ # for all four y*log(x)'s
+ mov p_xptr(%rsp),%rsi # get x_array pointer
+ mov p_yptr(%rsp),%rdi # get y_array pointer
+#
+# convert all four results to double
+ cvtpd2ps %xmm0,%xmm0
+ cvtpd2ps %xmm1,%xmm1
+ movlhps %xmm1,%xmm0
+
+# perform special case and error checking on input values
+
+# special case checking is done first in the scalar version since
+# it allows for early fast returns. But for vectors, we consider them
+# to be rare, so early returns are not necessary. So we first compute
+# the x**y values, and then check for special cases.
+
+# we do some of the checking in reverse order of the scalar version.
+# apply the negate result flags
+ orps p_negateres(%rsp),%xmm0 # get negateres
+
+## if y is infinite or so large that the result would overflow or underflow
+ movdqa p_ay(%rsp),%xmm4
+ cmpps $5,.L__mask_ly(%rip),%xmm4 # y not less than large value, ffs if so.
+ movmskps %xmm4,%edx
+ test $0x0f,%edx
+ jnz .Ly_large
+.Lrnsx3:
+
+## if x is infinite
+ movdqa p_ax(%rsp),%xmm4
+ cmpps $0,.L__mask_inf(%rip),%xmm4 # equal to infinity, ffs if so.
+ movmskps %xmm4,%edx
+ test $0x0f,%edx
+ jnz .Lx_infinite
+.Lrnsx1:
+## if x is zero
+ xorps %xmm4,%xmm4
+ cmpps $0,p_ax(%rsp),%xmm4 # equal to zero, ffs if so.
+ movmskps %xmm4,%edx
+ test $0x0f,%edx
+ jnz .Lx_zero
+.Lrnsx2:
+## if y is NAN
+ movdqu (%rdi),%xmm4 # get y
+ cmpps $4,%xmm4,%xmm4 # a compare not equal of y to itself should
+ # be false, unless y is a NaN. ff's if NaN.
+ movmskps %xmm4,%ecx
+ test $0x0f,%ecx
+ jnz .Ly_NaN
+.Lrnsx4:
+## if x is NAN
+ movdqu (%rsi),%xmm4 # get x
+ cmpps $4,%xmm4,%xmm4 # a compare not equal of x to itself should
+ # be false, unless x is a NaN. ff's if NaN.
+ movmskps %xmm4,%ecx
+ test $0x0f,%ecx
+ jnz .Lx_NaN
+.Lrnsx5:
+
+## if |y| == 0 then return 1
+ movdqa .L__float_one(%rip),%xmm3 # one
+ xorps %xmm2,%xmm2
+ cmpps $4,p_ay(%rsp),%xmm2 # not equal to 0.0?, ffs if not equal.
+ andps %xmm2,%xmm0 # keep the others
+ andnps %xmm3,%xmm2 # mask for ones
+ orps %xmm2,%xmm0
+## if x == +1, return +1 for all x
+ movdqa %xmm3,%xmm2
+ movdqu (%rsi),%xmm5
+ cmpps $4,%xmm5,%xmm2 # not equal to +1.0?, ffs if not equal.
+ andps %xmm2,%xmm0 # keep the others
+ andnps %xmm3,%xmm2 # mask for ones
+ orps %xmm2,%xmm0
+
+.L__powf_cleanup2:
+
+# update the x and y pointers
+ add $16,%rdi
+ add $16,%rsi
+ mov %rsi,p_xptr(%rsp) # save x_array pointer
+ mov %rdi,p_yptr(%rsp) # save y_array pointer
+# store the result _m128d
+ mov p_zptr(%rsp),%rdi # get z_array pointer
+ movups %xmm0,(%rdi)
+# prefetchw QWORD PTR [rdi+64]
+ prefetch 64(%rdi)
+ add $16,%rdi
+ mov %rdi,p_zptr(%rsp) # save z_array pointer
+
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vsa_top
+
+
+# see if we need to do any extras
+ mov p_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vsa_cleanup
+
+.L__final_check:
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+# we jump here when we have an odd number of log calls to make at the
+# end
+.L__vsa_cleanup:
+ mov p_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov p_xptr(%rsp),%rsi
+ mov p_yptr(%rsp),%rdi
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+ xorps %xmm0,%xmm0
+ movaps %xmm0,p2_temp(%rsp)
+ movaps %xmm0,p2_temp+16(%rsp)
+
+ mov (%rsi),%ecx # we know there's at least one
+ mov %ecx,p2_temp(%rsp)
+ mov (%rdi),%edx # we know there's at least one
+ mov %edx,p2_temp+16(%rsp)
+ cmp $2,%rax
+ jl .L__vsacg
+
+ mov 4(%rsi),%ecx # do the second value
+ mov %ecx,p2_temp+4(%rsp)
+ mov 4(%rdi),%edx # we know there's at least one
+ mov %edx,p2_temp+20(%rsp)
+ cmp $3,%rax
+ jl .L__vsacg
+
+ mov 8(%rsi),%ecx # do the third value
+ mov %ecx,p2_temp+8(%rsp)
+ mov 8(%rdi),%edx # we know there's at least one
+ mov %edx,p2_temp+24(%rsp)
+
+.L__vsacg:
+ mov $4,%rdi # parameter for N
+ lea p2_temp(%rsp),%rsi # &x parameter
+ lea p2_temp+16(%rsp),%rdx # &y parameter
+ lea p2_temp1(%rsp),%rcx # &z parameter
+ call vrsa_powf@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov p_zptr(%rsp),%rdi
+ mov p_nv(%rsp),%rax # get number of values
+ mov p2_temp1(%rsp),%ecx
+ mov %ecx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+4(%rsp),%ecx
+ mov %ecx,4(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+8(%rsp),%ecx
+ mov %ecx,8(%rdi) # do the third value
+
+.L__vsacgf:
+ jmp .L__final_check
+
+ .align 16
+# y is a NaN.
+.Ly_NaN:
+ mov p_yptr(%rsp),%rdx # get pointer to y
+ movdqu (%rdx),%xmm4 # get y
+ movdqa %xmm4,%xmm3
+ movdqa %xmm4,%xmm5
+ movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits
+ cmpps $0,%xmm4,%xmm4 # a compare equal of y to itself should
+ # be true, unless y is a NaN. 0's if NaN.
+ cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN.
+ andps %xmm4,%xmm0 # keep the other results
+ andps %xmm3,%xmm2 # get just the right signalling bits
+ andps %xmm5,%xmm3 # mask for the NaNs
+ orps %xmm2,%xmm3 # convert to QNaNs
+ orps %xmm3,%xmm0 # combine
+ jmp .Lrnsx4
+
+# y is a NaN.
+.Lx_NaN:
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ movdqu (%rcx),%xmm4 # get x
+ movdqa %xmm4,%xmm3
+ movdqa %xmm4,%xmm5
+ movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits
+ cmpps $0,%xmm4,%xmm4 # a compare equal of x to itself should
+ # be true, unless x is a NaN. 0's if NaN.
+ cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN.
+ andps %xmm4,%xmm0 # keep the other results
+ andps %xmm3,%xmm2 # get just the right signalling bits
+ andps %xmm5,%xmm3 # mask for the NaNs
+ orps %xmm2,%xmm3 # convert to QNaNs
+ orps %xmm3,%xmm0 # combine
+ jmp .Lrnsx5
+
+# y is infinite or so large that the result would
+# overflow or underflow.
+.Ly_large:
+ movdqa %xmm0,p_temp(%rsp)
+
+ test $1,%edx
+ jz .Lylrga
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov (%rcx),%eax
+ mov (%rbx),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+.Lylrga:
+ test $2,%edx
+ jz .Lylrgb
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov 4(%rcx),%eax
+ mov 4(%rbx),%ebx
+ mov p_inty+4(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+.Lylrgb:
+ test $4,%edx
+ jz .Lylrgc
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov 8(%rcx),%eax
+ mov 8(%rbx),%ebx
+ mov p_inty+8(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+.Lylrgc:
+ test $8,%edx
+ jz .Lylrgd
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov 12(%rcx),%eax
+ mov 12(%rbx),%ebx
+ mov p_inty+12(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+.Lylrgd:
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx3
+
+# a subroutine to treat an individual x,y pair when y is large or infinity
+# assumes x in .Ly(%rip),%eax in ebx.
+# returns result in eax
+.Lnp_special6:
+# handle |x|==1 cases first
+ mov $0x07FFFFFFF,%r8d
+ and %eax,%r8d
+ cmp $0x03f800000,%r8d # jump if |x| !=1
+ jnz .Lnps6
+ mov $0x03f800000,%eax # return 1 for all |x|==1
+ jmp .Lnpx64
+
+# cases where |x| !=1
+.Lnps6:
+ mov $0x07f800000,%ecx
+ xor %eax,%eax # assume 0 return
+ test $0x080000000,%ebx
+ jnz .Lnps62 # jump if y negative
+# y = +inf
+ cmp $0x03f800000,%r8d
+ cmovg %ecx,%eax # return inf if |x| < 1
+ jmp .Lnpx64
+.Lnps62:
+# y = -inf
+ cmp $0x03f800000,%r8d
+ cmovl %ecx,%eax # return inf if |x| < 1
+ jmp .Lnpx64
+
+.Lnpx64:
+ ret
+
+# handle cases where x is +/- infinity. edx is the mask
+ .align 16
+.Lx_infinite:
+ movdqa %xmm0,p_temp(%rsp)
+
+ test $1,%edx
+ jz .Lxinfa
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov (%rcx),%eax
+ mov (%rbx),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+.Lxinfa:
+ test $2,%edx
+ jz .Lxinfb
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov 4(%rcx),%eax
+ mov 4(%rbx),%ebx
+ mov p_inty+4(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+.Lxinfb:
+ test $4,%edx
+ jz .Lxinfc
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov 8(%rcx),%eax
+ mov 8(%rbx),%ebx
+ mov p_inty+8(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+.Lxinfc:
+ test $8,%edx
+ jz .Lxinfd
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov 12(%rcx),%eax
+ mov 12(%rbx),%ebx
+ mov p_inty+12(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+.Lxinfd:
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx1
+
+# a subroutine to treat an individual x,y pair when x is +/-infinity
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+.Lnp_special_x1: # x is infinite
+ test $0x080000000,%eax # is x positive
+ jnz .Lnsx11 # jump if not
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 # just return if so
+ xor %eax,%eax # else return 0
+ jmp .Lnsx13
+
+.Lnsx11:
+ cmp $1,%ecx # if inty ==1
+ jnz .Lnsx12 # jump if not
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 # just return if so
+ mov $0x080000000,%eax # else return -0
+ jmp .Lnsx13
+.Lnsx12: # inty <>1
+ and $0x07FFFFFFF,%eax # return -x (|x|) if y<0
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 #
+ xor %eax,%eax # return 0 if y >=0
+.Lnsx13:
+ ret
+
+
+# handle cases where x is +/- zero. edx is the mask of x,y pairs with |x|=0
+ .align 16
+.Lx_zero:
+ movdqa %xmm0,p_temp(%rsp)
+
+ test $1,%edx
+ jz .Lxzera
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov (%rcx),%eax
+ mov (%rbx),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+.Lxzera:
+ test $2,%edx
+ jz .Lxzerb
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov 4(%rcx),%eax
+ mov 4(%rbx),%ebx
+ mov p_inty+4(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+.Lxzerb:
+ test $4,%edx
+ jz .Lxzerc
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov 8(%rcx),%eax
+ mov 8(%rbx),%ebx
+ mov p_inty+8(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+.Lxzerc:
+ test $8,%edx
+ jz .Lxzerd
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_yptr(%rsp),%rbx # get pointer to y
+ mov 12(%rcx),%eax
+ mov 12(%rbx),%ebx
+ mov p_inty+12(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+.Lxzerd:
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx2
+
+# a subroutine to treat an individual x,y pair when x is +/-0
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+ .align 16
+.Lnp_special_x2:
+ cmp $1,%ecx # if inty ==1
+ jz .Lnsx21 # jump if so
+# handle cases of x=+/-0, y not integer
+ xor %eax,%eax
+ mov $0x07f800000,%ecx
+ test $0x080000000,%ebx # is ypos
+ cmovnz %ecx,%eax
+ jmp .Lnsx23
+# y is an integer
+.Lnsx21:
+ xor %r8d,%r8d
+ mov $0x07f800000,%ecx
+ test $0x080000000,%ebx # is ypos
+ cmovnz %ecx,%r8d # set to infinity if not
+ and $0x080000000,%eax # pickup the sign of x
+ or %r8d,%eax # and include it in the result
+.Lnsx23:
+ ret
+
+
+
+ .data
+ .align 64
+
+.L__mask_sign: .quad 0x08000000080000000 # a sign bit mask
+ .quad 0x08000000080000000
+
+.L__mask_nsign: .quad 0x07FFFFFFF7FFFFFFF # a not sign bit mask
+ .quad 0x07FFFFFFF7FFFFFFF
+
+# used by inty
+.L__mask_127: .quad 0x00000007F0000007F # EXPBIAS_SP32
+ .quad 0x00000007F0000007F
+
+.L__mask_mant: .quad 0x0007FFFFF007FFFFF # mantissa bit mask
+ .quad 0x0007FFFFF007FFFFF
+
+.L__mask_1: .quad 0x00000000100000001 # 1
+ .quad 0x00000000100000001
+
+.L__mask_2: .quad 0x00000000200000002 # 2
+ .quad 0x00000000200000002
+
+.L__mask_24: .quad 0x00000001800000018 # 24
+ .quad 0x00000001800000018
+
+.L__mask_23: .quad 0x00000001700000017 # 23
+ .quad 0x00000001700000017
+
+# used by special case checking
+
+.L__float_one: .quad 0x03f8000003f800000 # one
+ .quad 0x03f8000003f800000
+
+.L__mask_inf: .quad 0x07f8000007F800000 # inifinity
+ .quad 0x07f8000007F800000
+
+.L__mask_NaN: .quad 0x07fC000007FC00000 # NaN
+ .quad 0x07fC000007FC00000
+
+.L__mask_sigbit: .quad 0x00040000000400000 # QNaN bit
+ .quad 0x00040000000400000
+
+.L__mask_ly: .quad 0x04f0000004f000000 # large y
+ .quad 0x04f0000004f000000
+
+
diff --git a/src/gas/vrsapowxf.S b/src/gas/vrsapowxf.S
new file mode 100644
index 0000000..4f67daf
--- /dev/null
+++ b/src/gas/vrsapowxf.S
@@ -0,0 +1,753 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsapowxf.asm
+#
+# An array implementation of the powf libm function.
+# This routine raises the x array to a constant y power.
+#
+# Prototype:
+#
+# void vrsa_powxf(int n, float *x, float y, float *z);
+#
+# Places the results into the supplied z array.
+# Does not perform error handling, but does return C99 values for error
+# inputs. Denormal results are truncated to 0.
+#
+#
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+
+# define local variable storage offsets
+.equ p_temp,0x00 # xmmword
+.equ p_negateres,0x10 # qword
+
+.equ p_xexp,0x20 # qword
+
+.equ save_rbx,0x030 #qword
+
+.equ p_y,0x048 # y value
+
+.equ p_ax,0x050 # absolute x
+.equ p_sx,0x060 # sign of x's
+
+.equ p_ay,0x070 # absolute y
+.equ p_yexp,0x080 # unbiased exponent of y
+
+.equ p_inty,0x090 # integer y indicator
+
+.equ p_xptr,0x0a0 # ptr to x values
+.equ p_zptr,0x0b0 # ptr to z values
+
+.equ p_nv,0x0b8 #qword
+.equ p_iter,0x0c0 # qword storage for number of loop iterations
+
+.equ p2_temp,0x0d0 #qword
+.equ p2_temp1,0x0f0 #qword
+
+.equ stack_size,0x0118 # allocate 40h more than
+ # we need to avoid bank conflicts
+
+
+
+
+ .weak vrsa_powxf_
+ .set vrsa_powxf_,__vrsa_powxf__
+ .weak vrsa_powxf__
+ .set vrsa_powxf__,__vrsa_powxf__
+
+ .text
+ .align 16
+ .p2align 4,,15
+.globl __vrsa_powxf__
+ .type __vrsa_powxf__,@function
+__vrsa_powxf__:
+
+#/* a FORTRAN subroutine implementation of array powf
+#** VRSA_POWXF(N,X,Y,Z)
+#** C equivalent
+#*/
+#void vrsa_powxf_(int * n, float *x, float *y, float *z)
+#{
+# vrsa_powxf(*n,x,y,z);
+#}
+# parameters are passed in by Linux FORTRAN as:
+# edi - int n
+# rsi - float *x
+# rdx - float *y
+# rcx - float *z
+ mov (%rdi),%edi
+ movss (%rdx),%xmm0
+ mov %rcx,%rdx
+
+
+
+
+# parameters are passed in by Linux C as:
+# edi - int n
+# rsi - float *x
+# xmm0 - float y
+# rdx - float *z
+
+.globl vrsa_powxf
+ .type vrsa_powxf,@function
+vrsa_powxf:
+
+ sub $stack_size,%rsp
+ mov %rbx,save_rbx(%rsp) # save rbx
+
+ movss %xmm0,p_y(%rsp) # save y
+ mov %rsi,p_xptr(%rsp) # save pointer to x
+ mov %rdx,p_zptr(%rsp) # save pointer to z
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+#endif
+ test %rax,%rax # just return if count is zero
+ jz .L__final_check # exit if not
+
+ mov %rax,%rcx
+ mov %rcx,p_nv(%rsp) # save number of values
+
+#
+# classify y
+# vector 32 bit integer method
+# /* See whether y is an integer.
+# inty = 0 means not an integer.
+# inty = 1 means odd integer.
+# inty = 2 means even integer.
+# */
+# movdqa .LXMMWORD(%rip),%xmm4 PTR [rdx]
+# get yexp
+ mov p_y(%rsp),%r8d # r8 is uy
+ mov $0x07fffffff,%r9d
+ and %r8d,%r9d # r9 is ay
+
+## if |y| == 0 then return 1
+ cmp $0,%r9d # is y a zero?
+ jz .Ly_zero
+
+ mov $0x07f800000,%eax # EXPBITS_SP32
+ and %r9d,%eax # y exp
+
+ xor %edi,%edi
+ shr $23,%eax #>> EXPSHIFTBITS_SP32
+ sub $126,%eax # - EXPBIAS_SP32 + 1 - eax is now the unbiased exponent
+ mov $1,%ebx
+ cmp %ebx,%eax # if (yexp < 1)
+ cmovl %edi,%ebx
+ jl .Lsave_inty
+
+ mov $24,%ecx
+ cmp %ecx,%eax # if (yexp >24)
+ jle .Lcly1
+ mov $2,%ebx
+ jmp .Lsave_inty
+.Lcly1: # else 1<=yexp<=24
+ sub %eax,%ecx # build mask for mantissa
+ shl %cl,%ebx
+ dec %ebx # rbx = mask = (1 << (24 - yexp)) - 1
+
+ mov %r8d,%eax
+ and %ebx,%eax # if ((uy & mask) != 0)
+ cmovnz %edi,%ebx # inty = 0;
+ jnz .Lsave_inty
+
+ not %ebx # else if (((uy & ~mask) >> (24 - yexp)) & 0x00000001)
+ mov %r8d,%eax
+ and %ebx,%eax
+ shr %cl,%eax
+ inc %edi
+ and %edi,%eax
+ mov %edi,%ebx # inty = 1
+ jnz .Lsave_inty
+ inc %ebx # else inty = 2
+
+
+.Lsave_inty:
+ mov %r8d,p_y+4(%rsp) # save an extra copy of y
+ mov %ebx,p_inty(%rsp) # save inty
+
+ mov p_nv(%rsp),%rax # get number of values
+ mov %rax,%rcx
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vsa_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rcx # compute number of extra single calls
+ mov %rcx,p_nv(%rsp) # save number of left over values
+
+# process the array 4 values at a time.
+
+.L__vsa_top:
+# build the input _m128
+# first get x
+ mov p_xptr(%rsp),%rsi # get x_array pointer
+ movups (%rsi),%xmm0
+ prefetch 64(%rsi)
+
+
+ movaps %xmm0,%xmm2
+ andps .L__mask_nsign(%rip),%xmm0 # get abs x
+ andps .L__mask_sign(%rip),%xmm2 # mask for the sign bits
+ movaps %xmm0,p_ax(%rsp) # save them
+ movaps %xmm2,p_sx(%rsp) # save them
+# convert all four x's to double
+ cvtps2pd p_ax(%rsp),%xmm0
+ cvtps2pd p_ax+8(%rsp),%xmm1
+#
+# do x special case checking
+#
+# movdqa %xmm4,%xmm5
+# pcmpeqd %xmm3,%xmm5 ; is y not an integer? ff's if so
+# pand .LXMMWORD(%rip),%xmm5 PTR __mask_NaN ; these values will be NaNs, if x<0
+ pxor %xmm3,%xmm3
+ xor %eax,%eax
+ mov $0x07FC00000,%ecx
+ cmp $0,%ebx # is y not an integer?
+ cmovz %ecx,%eax # then set to return a NaN. else 0.
+ mov $0x080000000,%ecx
+ cmp $1,%ebx # is y an odd integer?
+ cmovz %ecx,%eax # maybe set sign bit if so
+ movd %eax,%xmm5
+ pshufd $0,%xmm5,%xmm5
+# shufps xmm5,%xmm5
+# movdqa %xmm4,%xmm2
+# pcmpeqd .LXMMWORD(%rip),%xmm2 PTR __mask_1 ; is it odd? ff's if so
+# pand .LXMMWORD(%rip),%xmm2 PTR __mask_sign ; these values might get their sign bit set
+# por %xmm2,%xmm5
+
+# cmpps xmm3,XMMWORD PTR p_sx[rsp],0 ; if the signs are set
+ pcmpeqd p_sx(%rsp),%xmm3 # if the signs are set
+ pandn %xmm5,%xmm3 # then negateres gets the values as shown below
+ movdqa %xmm3,p_negateres(%rsp) # save negateres
+
+# /* p_negateres now means the following.
+# 7FC00000 means x<0, y not an integer, return NaN.
+# 80000000 means x<0, y is odd integer, so set the sign bit.
+## 0 means even integer, and/or x>=0.
+# */
+
+# **** Here starts the main calculations ****
+# The algorithm used is x**y = exp(y*log(x))
+# Extra precision is required in intermediate steps to meet the 1ulp requirement
+#
+# log(x) calculation
+ call __vrd4_log@PLT # get the double precision log value
+ # for all four x's
+# y* logx
+ cvtps2pd p_y(%rsp),%xmm2 #convert the two packed single y's to double
+
+# /* just multiply by y */
+ mulpd %xmm2,%xmm0
+ mulpd %xmm2,%xmm1
+
+# /* The following code computes r = exp(w) */
+ call __vrd4_exp@PLT # get the double exp value
+ # for all four y*log(x)'s
+ mov p_xptr(%rsp),%rsi # get x_array pointer
+
+#
+# convert all four results to double
+ cvtpd2ps %xmm0,%xmm0
+ cvtpd2ps %xmm1,%xmm1
+ movlhps %xmm1,%xmm0
+
+# perform special case and error checking on input values
+
+# special case checking is done first in the scalar version since
+# it allows for early fast returns. But for vectors, we consider them
+# to be rare, so early returns are not necessary. So we first compute
+# the x**y values, and then check for special cases.
+
+# we do some of the checking in reverse order of the scalar version.
+# apply the negate result flags
+ orps p_negateres(%rsp),%xmm0 # get negateres
+
+## if y is infinite or so large that the result would overflow or underflow
+ mov p_y(%rsp),%edx # get y
+ and $0x07fffffff,%edx # develop ay
+# mov $0x04f000000,%eax
+ cmp $0x04f000000,%edx
+ ja .Ly_large
+.Lrnsx3:
+
+## if x is infinite
+ movdqa p_ax(%rsp),%xmm4
+ cmpps $0,.L__mask_inf(%rip),%xmm4 # equal to infinity, ffs if so.
+ movmskps %xmm4,%edx
+ test $0x0f,%edx
+ jnz .Lx_infinite
+.Lrnsx1:
+## if x is zero
+ xorps %xmm4,%xmm4
+ cmpps $0,p_ax(%rsp),%xmm4 # equal to zero, ffs if so.
+ movmskps %xmm4,%edx
+ test $0x0f,%edx
+ jnz .Lx_zero
+.Lrnsx2:
+## if y is NAN
+ movss p_y(%rsp),%xmm4 # get y
+ ucomiss %xmm4,%xmm4 # comparing y to itself should
+ # be true, unless y is a NaN. parity flag if NaN.
+ jp .Ly_NaN
+.Lrnsx4:
+## if x is NAN
+ movdqa p_ax(%rsp),%xmm4 # get x
+ cmpps $4,%xmm4,%xmm4 # a compare not equal of x to itself should
+ # be false, unless x is a NaN. ff's if NaN.
+ movmskps %xmm4,%ecx
+ test $0x0f,%ecx
+ jnz .Lx_NaN
+.Lrnsx5:
+
+## if x == +1, return +1 for all x
+ movdqa .L__float_one(%rip),%xmm3 # one
+ mov p_xptr(%rsp),%rdx # get pointer to x
+ movdqa %xmm3,%xmm2
+ movdqu (%rdx), %xmm5
+ cmpps $4,%xmm5,%xmm2 # not equal to +1.0?, ffs if not equal.
+ andps %xmm2,%xmm0 # keep the others
+ andnps %xmm3,%xmm2 # mask for ones
+ orps %xmm2,%xmm0
+
+.L__vsa_bottom:
+
+# update the x and y pointers
+ add $16,%rsi
+ mov %rsi,p_xptr(%rsp) # save x_array pointer
+# store the result _m128d
+ mov p_zptr(%rsp),%rdi # get z_array pointer
+ movups %xmm0,(%rdi)
+# prefetchw QWORD PTR [rdi+64]
+ prefetch 64(%rdi)
+ add $16,%rdi
+ mov %rdi,p_zptr(%rsp) # save z_array pointer
+
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vsa_top
+
+
+# see if we need to do any extras
+ mov p_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vsa_cleanup
+
+.L__final_check:
+
+ mov save_rbx(%rsp),%rbx # restore rbx
+ add $stack_size,%rsp
+ ret
+
+ .align 16
+# we jump here when we have an odd number of calls to make at the
+# end
+.L__vsa_cleanup:
+ mov p_nv(%rsp),%rax # get number of values
+
+ mov p_xptr(%rsp),%rsi
+ mov p_y(%rsp),%r8d # r8 is uy
+
+# fill in a m128 with zeroes and the extra values and then make a recursive call.
+ xorps %xmm0,%xmm0
+ movaps %xmm0,p2_temp(%rsp)
+ movaps %xmm0,p2_temp+16(%rsp)
+
+ mov (%rsi),%ecx # we know there's at least one
+ mov %ecx,p2_temp(%rsp)
+ mov %r8d,p2_temp+16(%rsp)
+ cmp $2,%rax
+ jl .L__vsacg
+
+ mov 4(%rsi),%ecx # do the second value
+ mov %ecx,p2_temp+4(%rsp)
+ mov %r8d,p2_temp+20(%rsp)
+ cmp $3,%rax
+ jl .L__vsacg
+
+ mov 8(%rsi),%ecx # do the third value
+ mov %ecx,p2_temp+8(%rsp)
+ mov %r8d,p2_temp+24(%rsp)
+
+.L__vsacg:
+ mov $4,%rdi # parameter for N
+ lea p2_temp(%rsp),%rsi # &x parameter
+ movaps p2_temp+16(%rsp),%xmm0 # y parameter
+ lea p2_temp1(%rsp),%rdx # &z parameter
+ call vrsa_powxf@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov p_zptr(%rsp),%rdi
+ mov p_nv(%rsp),%rax # get number of values
+ mov p2_temp1(%rsp),%ecx
+ mov %ecx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+4(%rsp),%ecx
+ mov %ecx,4(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vsacgf
+
+ mov p2_temp1+8(%rsp),%ecx
+ mov %ecx,8(%rdi) # do the third value
+
+.L__vsacgf:
+ jmp .L__final_check
+
+
+ .align 16
+.Ly_zero:
+## if |y| == 0 then return 1
+ mov $0x03f800000,%ecx # one
+# fill all results with a one
+ mov p_zptr(%rsp),%r9 # &z parameter
+ mov p_nv(%rsp),%rax # get number of values
+.L__yzt:
+ mov %ecx,(%r9) # store a 1
+ add $4,%r9
+ sub $1,%rax
+ test %rax,%rax
+ jnz .L__yzt
+ jmp .L__final_check
+# y is a NaN.
+.Ly_NaN:
+ mov p_y(%rsp),%r8d
+ or $0x000400000,%r8d # convert to QNaNs
+ movd %r8d,%xmm0 # propagate to all results
+ shufps $0,%xmm0,%xmm0
+ jmp .Lrnsx4
+
+# x is a NaN.
+.Lx_NaN:
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ movdqu (%rcx),%xmm4 # get x
+ movdqa %xmm4,%xmm3
+ movdqa %xmm4,%xmm5
+ movdqa .L__mask_sigbit(%rip),%xmm2 # get the signalling bits
+ cmpps $0,%xmm4,%xmm4 # a compare equal of x to itself should
+ # be true, unless x is a NaN. 0's if NaN.
+ cmpps $4,%xmm3,%xmm3 # compare not equal, ff's if NaN.
+ andps %xmm4,%xmm0 # keep the other results
+ andps %xmm3,%xmm2 # get just the right signalling bits
+ andps %xmm5,%xmm3 # mask for the NaNs
+ orps %xmm2,%xmm3 # convert to QNaNs
+ orps %xmm3,%xmm0 # combine
+ jmp .Lrnsx5
+
+# y is infinite or so large that the result would
+# overflow or underflow.
+.Ly_large:
+ movdqa %xmm0,p_temp(%rsp)
+
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov (%rcx),%eax
+ mov p_y(%rsp),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov 4(%rcx),%eax
+ mov p_y(%rsp),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov 8(%rcx),%eax
+ mov p_y(%rsp),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov 12(%rcx),%eax
+ mov p_y(%rsp),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special6 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx3
+
+# a subroutine to treat an individual x,y pair when y is large or infinity
+# assumes x in .Ly(%rip),%eax in ebx.
+# returns result in eax
+.Lnp_special6:
+# handle |x|==1 cases first
+ mov $0x07FFFFFFF,%r8d
+ and %eax,%r8d
+ cmp $0x03f800000,%r8d # jump if |x| !=1
+ jnz .Lnps6
+ mov $0x03f800000,%eax # return 1 for all |x|==1
+ jmp .Lnpx64
+
+# cases where |x| !=1
+.Lnps6:
+ mov $0x07f800000,%ecx
+ xor %eax,%eax # assume 0 return
+ test $0x080000000,%ebx
+ jnz .Lnps62 # jump if y negative
+# y = +inf
+ cmp $0x03f800000,%r8d
+ cmovg %ecx,%eax # return inf if |x| < 1
+ jmp .Lnpx64
+.Lnps62:
+# y = -inf
+ cmp $0x03f800000,%r8d
+ cmovl %ecx,%eax # return inf if |x| < 1
+ jmp .Lnpx64
+
+.Lnpx64:
+ ret
+
+# handle cases where x is +/- infinity. edx is the mask
+ .align 16
+.Lx_infinite:
+ movdqa %xmm0,p_temp(%rsp)
+
+ test $1,%edx
+ jz .Lxinfa
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov (%rcx),%eax
+ mov p_y(%rsp),%ebx
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+.Lxinfa:
+ test $2,%edx
+ jz .Lxinfb
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 4(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+.Lxinfb:
+ test $4,%edx
+ jz .Lxinfc
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 8(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+.Lxinfc:
+ test $8,%edx
+ jz .Lxinfd
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 12(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x1 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+.Lxinfd:
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx1
+
+# a subroutine to treat an individual x,y pair when x is +/-infinity
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+.Lnp_special_x1: # x is infinite
+ test $0x080000000,%eax # is x positive
+ jnz .Lnsx11 # jump if not
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 # just return if so
+ xor %eax,%eax # else return 0
+ jmp .Lnsx13
+
+.Lnsx11:
+ cmp $1,%ecx # if inty ==1
+ jnz .Lnsx12 # jump if not
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 # just return if so
+ mov $0x080000000,%eax # else return -0
+ jmp .Lnsx13
+.Lnsx12: # inty <>1
+ and $0x07FFFFFFF,%eax # return -x (|x|) if y<0
+ test $0x080000000,%ebx # is y positive
+ jz .Lnsx13 #
+ xor %eax,%eax # return 0 if y >=0
+.Lnsx13:
+ ret
+
+
+# handle cases where x is +/- zero. edx is the mask of x,y pairs with |x|=0
+ .align 16
+.Lx_zero:
+ movdqa %xmm0,p_temp(%rsp)
+
+ test $1,%edx
+ jz .Lxzera
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov (%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp(%rsp)
+.Lxzera:
+ test $2,%edx
+ jz .Lxzerb
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 4(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+4(%rsp)
+.Lxzerb:
+ test $4,%edx
+ jz .Lxzerc
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 8(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+8(%rsp)
+.Lxzerc:
+ test $8,%edx
+ jz .Lxzerd
+ mov p_xptr(%rsp),%rcx # get pointer to x
+ mov p_y(%rsp),%ebx
+ mov 12(%rcx),%eax
+ mov p_inty(%rsp),%ecx
+ sub $8,%rsp
+ call .Lnp_special_x2 # call the handler for one value
+ add $8,%rsp
+ mov %eax,p_temp+12(%rsp)
+.Lxzerd:
+ movdqa p_temp(%rsp),%xmm0
+ jmp .Lrnsx2
+
+# a subroutine to treat an individual x,y pair when x is +/-0
+# assumes x in .Ly(%rip),%eax in ebx, inty in ecx.
+# returns result in eax
+ .align 16
+.Lnp_special_x2:
+ cmp $1,%ecx # if inty ==1
+ jz .Lnsx21 # jump if so
+# handle cases of x=+/-0, y not integer
+ xor %eax,%eax
+ mov $0x07f800000,%ecx
+ test $0x080000000,%ebx # is ypos
+ cmovnz %ecx,%eax
+ jmp .Lnsx23
+# y is an integer
+.Lnsx21:
+ xor %r8d,%r8d
+ mov $0x07f800000,%ecx
+ test $0x080000000,%ebx # is ypos
+ cmovnz %ecx,%r8d # set to infinity if not
+ and $0x080000000,%eax # pickup the sign of x
+ or %r8d,%eax # and include it in the result
+.Lnsx23:
+ ret
+
+ .data
+ .align 64
+
+.L__mask_sign: .quad 0x08000000080000000 # a sign bit mask
+ .quad 0x08000000080000000
+
+.L__mask_nsign: .quad 0x07FFFFFFF7FFFFFFF # a not sign bit mask
+ .quad 0x07FFFFFFF7FFFFFFF
+
+# used by inty
+.L__mask_127: .quad 0x00000007F0000007F # EXPBIAS_SP32
+ .quad 0x00000007F0000007F
+
+.L__mask_mant: .quad 0x0007FFFFF007FFFFF # mantissa bit mask
+ .quad 0x0007FFFFF007FFFFF
+
+.L__mask_1: .quad 0x00000000100000001 # 1
+ .quad 0x00000000100000001
+
+.L__mask_2: .quad 0x00000000200000002 # 2
+ .quad 0x00000000200000002
+
+.L__mask_24: .quad 0x00000001800000018 # 24
+ .quad 0x00000001800000018
+
+.L__mask_23: .quad 0x00000001700000017 # 23
+ .quad 0x00000001700000017
+
+# used by special case checking
+
+.L__float_one: .quad 0x03f8000003f800000 # one
+ .quad 0x03f8000003f800000
+
+.L__mask_inf: .quad 0x07f8000007F800000 # inifinity
+ .quad 0x07f8000007F800000
+
+.L__mask_ninf: .quad 0x0ff800000fF800000 # -inifinity
+ .quad 0x0ff800000fF800000
+
+.L__mask_NaN: .quad 0x07fC000007FC00000 # NaN
+ .quad 0x07fC000007FC00000
+
+.L__mask_sigbit: .quad 0x00040000000400000 # QNaN bit
+ .quad 0x00040000000400000
+
+.L__mask_impbit: .quad 0x00080000000800000 # implicit bit
+ .quad 0x00080000000800000
+
+.L__mask_ly: .quad 0x04f0000004f000000 # large y
+ .quad 0x04f0000004f000000
+
+
+
diff --git a/src/gas/vrsasincosf.S b/src/gas/vrsasincosf.S
new file mode 100644
index 0000000..2bb70bf
--- /dev/null
+++ b/src/gas/vrsasincosf.S
@@ -0,0 +1,2008 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsasincosf.s
+#
+# A vector implementation of the sincos libm function.
+#
+# Prototype:
+#
+# __vrsa_sincosf(int n, float* x, float* ys, float* yc);
+#
+# Computes Sine and Cosine of x for an array of input values.
+# Places the Sine results into the supplied ys array and the Cosine results into the supplied yc array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This routine computes 4 single precision Sine Cosine values at a time.
+# The four values are passed as packed single in xmm0.
+# The four Sine results are returned as packed singles in the supplied ys array.
+# The four Cosine results are returned as packed singles in the supplied yc array.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+
+.Lcosarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03FA5555555502F31
+ .quad 0x0BF56C16BF55699D7 # -0.00138889 c2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4
+ .quad 0x0BE92524743CC46B8
+
+.Lsinarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BFC555555545E87D
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x03F811110DF01232D
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x0BE92524743CC46B8
+
+.Lcossinarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BF56C16BF55699D7 # c2
+ .quad 0x03F811110DF01232D
+ .quad 0x03EFA015C50A93B49 # c3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x0BE92524743CC46B8 # c4
+ .quad 0x03EC6DBE4AD1572D5
+
+.align 8
+ .Levensin_oddcos_tbl:
+
+ .quad .Lsinsin_sinsin_piby4 # 0 * ; Done
+ .quad .Lsinsin_sincos_piby4 # 1 + ; Done
+ .quad .Lsinsin_cossin_piby4 # 2 ; Done
+ .quad .Lsinsin_coscos_piby4 # 3 + ; Done
+
+ .quad .Lsincos_sinsin_piby4 # 4 ; Done
+ .quad .Lsincos_sincos_piby4 # 5 * ; Done
+ .quad .Lsincos_cossin_piby4 # 6 ; Done
+ .quad .Lsincos_coscos_piby4 # 7 ; Done
+
+ .quad .Lcossin_sinsin_piby4 # 8 ; Done
+ .quad .Lcossin_sincos_piby4 # 9 ; TBD
+ .quad .Lcossin_cossin_piby4 # 10 * ; Done
+ .quad .Lcossin_coscos_piby4 # 11 ; Done
+
+ .quad .Lcoscos_sinsin_piby4 # 12 ; Done
+ .quad .Lcoscos_sincos_piby4 # 13 + ; Done
+ .quad .Lcoscos_cossin_piby4 # 14 ; Done
+ .quad .Lcoscos_coscos_piby4 # 15 * ; Done
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .weak vrsa_sincosf_
+ .set vrsa_sincosf_,__vrsa_sincosf__
+ .weak vrsa_sincosf__
+ .set vrsa_sincosf__,__vrsa_sincosf__
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#FORTRAN subroutine implementation of array sincos
+#VRSA_SINCOSF(N,X,Y,Z)
+#C equivalent*/
+#void vrsa_sincosf__(int * n, double *x, double *y, double *z)
+#{
+# vrsa_sincosf(*n,x,y,z);
+#}
+
+.globl __vrsa_sincosf__
+ .type __vrsa_sincosf__,@function
+__vrsa_sincosf__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ p_temp,0 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for get/put bits operation
+
+.equ save_xmm6,0x20 # temporary for get/put bits operation
+.equ save_xmm7,0x30 # temporary for get/put bits operation
+.equ save_xmm8,0x40 # temporary for get/put bits operation
+.equ save_xmm9,0x50 # temporary for get/put bits operation
+.equ save_xmm0,0x60 # temporary for get/put bits operation
+.equ save_xmm11,0x70 # temporary for get/put bits operation
+.equ save_xmm12,0x80 # temporary for get/put bits operation
+.equ save_xmm13,0x90 # temporary for get/put bits operation
+.equ save_xmm14,0x0A0 # temporary for get/put bits operation
+.equ save_xmm15,0x0B0 # temporary for get/put bits operation
+
+.equ r,0x0C0 # pointer to r for remainder_piby2
+.equ rr,0x0D0 # pointer to r for remainder_piby2
+.equ region,0x0E0 # pointer to r for remainder_piby2
+
+.equ r1,0x0F0 # pointer to r for remainder_piby2
+.equ rr1,0x0100 # pointer to r for remainder_piby2
+.equ region1,0x0110 # pointer to r for remainder_piby2
+
+.equ p_temp2,0x0120 # temporary for get/put bits operation
+.equ p_temp3,0x0130 # temporary for get/put bits operation
+
+.equ p_temp4,0x0140 # temporary for get/put bits operation
+.equ p_temp5,0x0150 # temporary for get/put bits operation
+
+.equ p_original,0x0160 # original x
+.equ p_mask,0x0170 # original x
+.equ p_sign_sin,0x0180 # original x
+
+.equ p_original1,0x0190 # original x
+.equ p_mask1,0x01A0 # original x
+.equ p_sign1_sin,0x01B0 # original x
+
+
+.equ save_r12,0x01C0 # temporary for get/put bits operation
+.equ save_r13,0x01D0 # temporary for get/put bits operation
+
+.equ p_sin,0x01E0 # sin
+.equ p_cos,0x01F0 # cos
+
+.equ save_rdi,0x0200 # temporary for get/put bits operation
+.equ save_rsi,0x0210 # temporary for get/put bits operation
+
+.equ p_sign_cos,0x0220 # Sign of lower cos term
+.equ p_sign1_cos,0x0230 # Sign of upper cos term
+
+.equ save_xa,0x0240 #qword ; leave space for 4 args*****
+.equ save_ysa,0x0250 #qword ; leave space for 4 args*****
+.equ save_yca,0x0260 #qword ; leave space for 4 args*****
+
+.equ save_nv,0x0270 #qword
+.equ p_iter,0x0280 #qword storage for number of loop iterations
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.globl vrsa_sincosf
+ .type vrsa_sincosf,@function
+vrsa_sincosf:
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# parameters are passed in by Linux as:
+# rcx - int n
+# rdx - double *x
+# r8 - double *y
+
+ sub $0x0298,%rsp
+ mov %r12,save_r12(%rsp) # save r12
+ mov %r13,save_r13(%rsp) # save r13
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START PROCESS INPUT
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ysa(%rsp) # save ysin_array pointer
+ mov %rcx,save_yca(%rsp) # save ycos_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vrsa_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrsa_top:
+# build the input _m128d
+# movapd .L__real_7fffffffffffffff,%xmm2 #
+# mov .L__real_7fffffffffffffff,%rdx #
+
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlps (%rsi),%xmm0
+ movhps 8(%rsi),%xmm0
+
+ prefetch 32(%rsi)
+ add $16,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#STARTMAIN
+
+ movhlps %xmm0,%xmm8
+ cvtps2pd %xmm0,%xmm10 # convert input to double.
+ cvtps2pd %xmm8,%xmm1 # convert input to double.
+
+movdqa %xmm10,%xmm6
+movdqa %xmm1,%xmm7
+movapd .L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd %xmm2,%xmm10 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+mov %rdi, p_sin(%rsp) # save address for sin return
+mov %rsi, p_cos(%rsp) # save address for cos return
+
+movd %xmm10,%rax #rax is lower arg
+movhpd %xmm10, p_temp+8(%rsp) #
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+
+movd %xmm1,%r8 #r8 is lower arg
+movhpd %xmm1, p_temp1+8(%rsp) #
+mov p_temp1+8(%rsp),%r9 #r9 = upper arg
+
+movdqa %xmm10,%xmm12
+movdqa %xmm1,%xmm13
+
+pcmpgtd %xmm6,%xmm12
+pcmpgtd %xmm7,%xmm13
+movdqa %xmm12,%xmm6
+movdqa %xmm13,%xmm7
+psrldq $4,%xmm12
+psrldq $4,%xmm13
+psrldq $8,%xmm6
+psrldq $8,%xmm7
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use +
+
+por %xmm6,%xmm12
+por %xmm7,%xmm13
+
+movd %xmm12,%r12 #Move Sign to gpr **
+movd %xmm13,%r13 #Move Sign to gpr **
+
+movapd %xmm10,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm10,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm10
+ mulpd %xmm10,%xmm2 # * twobypi
+ mulpd %xmm10,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%r8 # Region
+ movd %xmm5,%r9 # Region
+
+#DELETE
+# mov .LQWORD,%rdx PTR __reald_one_zero ;compare value for cossin path
+#DELETE
+
+ mov %r8,%r10
+ mov %r9,%r11
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail, r8 = region, r10 = region, r12 = Sign
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail, r9 = region, r11 = region, r13 = Sign
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+# NEW
+
+ #ADDED
+ mov %r10,%rdi # npi2 in int
+ mov %r11,%rsi # npi2 in int
+ #ADDED
+
+ shr $1,%r10 # 0 and 1 => 0
+ shr $1,%r11 # 2 and 3 => 1
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ #ADDED
+ xor %r10,%rdi # xor last 2 bits of region for cos
+ xor %r11,%rsi # xor last 2 bits of region for cos
+ #ADDED
+
+ not %r12 #~(sign)
+ not %r13 #~(sign)
+ and %r12,%r10 #region & ~(sign)
+ and %r13,%r11 #region & ~(sign)
+
+ not %rax #~(region)
+ not %rcx #~(region)
+ not %r12 #~~(sign)
+ not %r13 #~~(sign)
+ and %r12,%rax #~region & ~~(sign)
+ and %r13,%rcx #~region & ~~(sign)
+
+ #ADDED
+ and .L__reald_one_one(%rip),%rdi # sign for cos
+ and .L__reald_one_one(%rip),%rsi # sign for cos
+ #ADDED
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 # sign for sin
+ and .L__reald_one_one(%rip),%r11 # sign for sin
+
+
+
+
+
+
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ #ADDED
+ mov %rdi,%rax
+ mov %rsi,%rcx
+ #ADDED
+
+ and .L__reald_one_zero(%rip),%r12 #mask out the lower sign bit leaving the upper sign bit
+ and .L__reald_one_zero(%rip),%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ #ADDED
+ and .L__reald_one_zero(%rip),%rax #mask out the lower sign bit leaving the upper sign bit
+ and .L__reald_one_zero(%rip),%rcx #mask out the lower sign bit leaving the upper sign bit
+ #ADDED
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ #ADDED
+ shl $63,%rdi #shift lower sign bit left by 63 bits
+ shl $63,%rsi #shift lower sign bit left by 63 bits
+ shl $31,%rax #shift upper sign bit left by 31 bits
+ shl $31,%rcx #shift upper sign bit left by 31 bits
+ #ADDED
+
+ mov %r10,p_sign_sin(%rsp) #write out lower sign bit
+ mov %r12,p_sign_sin+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1_sin(%rsp) #write out lower sign bit
+ mov %r13,p_sign1_sin+8(%rsp) #write out upper sign bit
+
+ mov %rdi,p_sign_cos(%rsp) #write out lower sign bit
+ mov %rax,p_sign_cos+8(%rsp) #write out upper sign bit
+ mov %rsi,p_sign1_cos(%rsp) #write out lower sign bit
+ mov %rcx,p_sign1_cos+8(%rsp) #write out upper sign bit
+
+# NEW
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm10,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+# subpd %xmm10,%xmm6 ;rr=rhead-r
+# subpd %xmm1,%xmm7 ;rr=rhead-r
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2 # move r for r2
+ movapd %xmm1,%xmm3 # move r for r2
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+# subpd xmm6, xmm8 ;rr=(rhead-r) -rtail
+# subpd xmm7, xmm9 ;rr=(rhead-r) -rtail
+
+ and .L__reald_zero_one(%rip),%rax # region for jump table
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+
+# HARSHA ADDED
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_sin = Sign, p_sign_cos = Sign, xmm10 = r, xmm2 = r2
+# p_sign1_sin = Sign, p_sign1_cos = Sign, xmm1 = r, xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm14 # for x3
+ movapd %xmm3,%xmm15 # for x3
+
+ movapd %xmm2,%xmm0 # for r
+ movapd %xmm3,%xmm11 # for r
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ movdqa .Lsinarray+0x30(%rip),%xmm6 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm7 # c4
+
+ movapd .Lsinarray+0x10(%rip),%xmm12 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm13 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm10,%xmm14 # x3
+ mulpd %xmm1,%xmm15 # x3
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm6 # c2*x2
+ mulpd %xmm3,%xmm7 # c2*x2
+
+ mulpd %xmm2,%xmm12 # c4*x2
+ mulpd %xmm3,%xmm13 # c4*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ addpd .Lsinarray+0x20(%rip),%xmm6 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm7 # c3+x2c4
+
+ addpd .Lsinarray(%rip),%xmm12 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm13 # c1+x2c2
+
+ mulpd %xmm2,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm5 # x4(c3+x2c4)
+
+ mulpd %xmm2,%xmm6 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm7 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ addpd %xmm12,%xmm6 # zs
+ addpd %xmm13,%xmm7 # zs
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ mulpd %xmm14,%xmm6 # x3 * zs
+ mulpd %xmm15,%xmm7 # x3 * zs
+
+ subpd %xmm0,%xmm4 # - (-t)
+ subpd %xmm11,%xmm5 # - (-t)
+
+ addpd %xmm10,%xmm6 # +x
+ addpd %xmm1,%xmm7 # +x
+
+# HARSHA ADDED
+
+ lea .Levensin_oddcos_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+ movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm10
+ subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail)
+ subsd %xmm10,%xmm6 # rr=rhead-r
+ subsd %xmm0,%xmm6 # xmm6 = rr=((rhead-r) -rtail)
+ movlpd %xmm10,r+8(%rsp) # store upper r
+ movlpd %xmm6,rr+8(%rsp) # store upper rr
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_lower_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_sincosf_lower_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+ movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+ movd %xmm10,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm6,%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+
+# movsd %xmm6,%xmm10
+# subsd xmm10,xmm0 ; xmm10 = r=(rhead-rtail)
+# subsd %xmm10,%xmm6 ; rr=rhead-r
+# subsd xmm6, xmm0 ; xmm6 = rr=((rhead-r) -rtail)
+
+ subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail)
+
+# movlpd QWORD PTR r[rsp], xmm10 ; store upper r
+# movlpd QWORD PTR rr[rsp], xmm6 ; store upper rr
+
+ movlpd %xmm6,r(%rsp) # store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_upper_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_sincosf_upper_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm5,region1(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm1,%xmm7 ; rhead
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+# subpd %xmm1,%xmm7 ; rr=rhead-r
+# subpd xmm7, xmm9 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr1[rsp], xmm7
+
+ jmp .L__vrs4_sincosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm10,%xmm6 ; rhead
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# subpd %xmm10,%xmm6 ; rr=rhead-r
+# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+ mov $0x411E848000000000,%r10 #5e5 +
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+
+
+ subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail)
+
+ movlpd %xmm7,r1+8(%rsp) # store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sincosf_lower_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_sincosf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+ movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm1,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+
+
+ jmp 0f
+
+.L__vrs4_sincosf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm7,%rdi #Restore upper fp arg for remainder_piby2 call
+
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sincosf_upper_naninf_of_both_gt_5e5_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) #region = 0
+
+.align 16
+0:
+
+ jmp .L__vrs4_sincosf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm10,%xmm6 ; rhead
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# subpd %xmm10,%xmm6 ; rr=rhead-r
+# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+
+# movsd %xmm7,%xmm1
+# subsd xmm1, xmm10 ; xmm10 = r=(rhead-rtail)
+# subsd %xmm1,%xmm7 ; rr=rhead-r
+# subsd xmm7, xmm10 ; xmm6 = rr=((rhead-r) -rtail)
+
+ subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail)
+
+# movlpd QWORD PTR r1[rsp], xmm1 ; store upper r
+# movlpd QWORD PTR rr1[rsp], xmm7 ; store upper rr
+
+ movlpd %xmm7,r1(%rsp) # store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sincosf_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1+8(%rsp),%rdi #Restore upper fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sincosf_upper_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_sincosf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sincosf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_sin = Sign, ; p_sign_cos = Sign, xmm10 = r, xmm2 = r2
+# p_sign1_sin = Sign, ; p_sign1_cos = Sign, xmm1 = r, xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd r(%rsp),%xmm10
+ movapd r1(%rsp),%xmm1
+
+ mov region(%rsp),%r8
+ mov region1(%rsp),%r9
+
+ mov %r8,%r10
+ mov %r9,%r11
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+
+# NEW
+
+ #ADDED
+ mov %r10,%rdi
+ mov %r11,%rsi
+ #ADDED
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ #ADDED
+ xor %r10,%rdi
+ xor %r11,%rsi
+ #ADDED
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ #ADDED
+ and .L__reald_one_one(%rip),%rdi #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%rsi #(~AB+A~B)&1
+ #ADDED
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+
+
+
+
+
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ #ADDED
+ mov %rdi,%rax
+ mov %rsi,%rcx
+ #ADDED
+
+ and .L__reald_one_zero(%rip),%r12 #mask out the lower sign bit leaving the upper sign bit
+ and .L__reald_one_zero(%rip),%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ #ADDED
+ and .L__reald_one_zero(%rip),%rax #mask out the lower sign bit leaving the upper sign bit
+ and .L__reald_one_zero(%rip),%rcx #mask out the lower sign bit leaving the upper sign bit
+ #ADDED
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ #ADDED
+ shl $63,%rdi #shift lower sign bit left by 63 bits
+ shl $63,%rsi #shift lower sign bit left by 63 bits
+ shl $31,%rax #shift upper sign bit left by 31 bits
+ shl $31,%rcx #shift upper sign bit left by 31 bits
+ #ADDED
+
+ mov %r10,p_sign_sin(%rsp) #write out lower sign bit
+ mov %r12,p_sign_sin+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1_sin(%rsp) #write out lower sign bit
+ mov %r13,p_sign1_sin+8(%rsp) #write out upper sign bit
+
+ mov %rdi,p_sign_cos(%rsp) #write out lower sign bit
+ mov %rax,p_sign_cos+8(%rsp) #write out upper sign bit
+ mov %rsi,p_sign1_cos(%rsp) #write out lower sign bit
+ mov %rcx,p_sign1_cos+8(%rsp) #write out upper sign bit
+#NEW
+
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+
+# HARSHA ADDED
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign_cos = Sign, p_sign_sin = Sign, xmm10 = r, xmm2 = r2
+# p_sign1_cos = Sign, p_sign1_sin = Sign, xmm1 = r, xmm3 = r2
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm14 # for x3
+ movapd %xmm3,%xmm15 # for x3
+
+ movapd %xmm2,%xmm0 # for r
+ movapd %xmm3,%xmm11 # for r
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ movdqa .Lsinarray+0x30(%rip),%xmm6 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm7 # c4
+
+ movapd .Lsinarray+0x10(%rip),%xmm12 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm13 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm10,%xmm14 # x3
+ mulpd %xmm1,%xmm15 # x3
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm6 # c2*x2
+ mulpd %xmm3,%xmm7 # c2*x2
+
+ mulpd %xmm2,%xmm12 # c4*x2
+ mulpd %xmm3,%xmm13 # c4*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ addpd .Lsinarray+0x20(%rip),%xmm6 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm7 # c3+x2c4
+
+ addpd .Lsinarray(%rip),%xmm12 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm13 # c1+x2c2
+
+ mulpd %xmm2,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm5 # x4(c3+x2c4)
+
+ mulpd %xmm2,%xmm6 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm7 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ addpd %xmm12,%xmm6 # zs
+ addpd %xmm13,%xmm7 # zs
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ mulpd %xmm14,%xmm6 # x3 * zs
+ mulpd %xmm15,%xmm7 # x3 * zs
+
+ subpd %xmm0,%xmm4 # - (-t)
+ subpd %xmm11,%xmm5 # - (-t)
+
+ addpd %xmm10,%xmm6 # +x
+ addpd %xmm1,%xmm7 # +x
+
+# HARSHA ADDED
+
+
+ lea .Levensin_oddcos_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrsa_sincosf_cleanup:
+
+ movapd p_sign_cos(%rsp),%xmm10
+ movapd p_sign1_cos(%rsp),%xmm1
+ xorpd %xmm4,%xmm10 # Cos term (+) Sign
+ xorpd %xmm5,%xmm1 # Cos term (+) Sign
+
+ cvtpd2ps %xmm10,%xmm0
+ cvtpd2ps %xmm1,%xmm11
+
+ movapd p_sign_sin(%rsp),%xmm14
+ movapd p_sign1_sin(%rsp),%xmm15
+ xorpd %xmm6,%xmm14 # Sin term (+) Sign
+ xorpd %xmm7,%xmm15 # Sin term (+) Sign
+
+ cvtpd2ps %xmm14,%xmm12
+ cvtpd2ps %xmm15,%xmm13
+
+
+.L__vrsa_bottom1:
+# store the result _m128d
+
+ mov save_ysa(%rsp),%r8
+ mov save_yca(%rsp),%r9
+
+ movlps %xmm0, (%r9) # save the cos
+ movlps %xmm12, (%r8) # save the sin
+ movlps %xmm11, 8(%r9) # save the cos
+ movlps %xmm13, 8(%r8) # save the sin
+
+
+ prefetch 32(%r8)
+ prefetch 32(%r9)
+
+ add $16,%r8
+ add $16,%r9
+
+ mov %r8,save_ysa(%rsp) # save y_sinarray pointer
+ mov %r9,save_yca(%rsp) # save y_cosarray pointer
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vrsa_top
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vrsa_cleanup
+
+.L__final_check:
+
+ mov save_r12(%rsp),%r12 # restore r12
+ mov save_r13(%rsp),%r13 # restore r13
+
+ add $0x0298,%rsp
+ ret
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# we assume that rdx is pointing at the next x array element, r8 at the next y array element.
+# The number of values left is in save_nv
+
+.align 16
+.L__vrsa_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ysa(%rsp),%rdi
+ mov save_yca(%rsp),%r12
+
+
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorps %xmm0,%xmm0
+ movss %xmm0,p_temp+4(%rsp)
+ movlps %xmm0,p_temp+8(%rsp)
+
+
+ mov (%rsi),%ecx # we know there's at least one
+ mov %ecx,p_temp(%rsp)
+ cmp $2,%rax
+ jl .L__vrsacg
+
+ mov 4(%rsi),%ecx # do the second value
+ mov %ecx,p_temp+4(%rsp)
+ cmp $3,%rax
+ jl .L__vrsacg
+
+ mov 8(%rsi),%ecx # do the third value
+ mov %ecx,p_temp+8(%rsp)
+
+.L__vrsacg:
+ mov $4,%rdi # parameter for N
+ lea p_temp(%rsp),%rsi # &x parameter
+ lea p_temp2(%rsp),%rdx # &ys parameter
+ lea p_temp3(%rsp),%rcx # &yc parameter
+ call vrsa_sincosf@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ysa(%rsp),%rdi
+ mov save_yca(%rsp),%r12
+ mov save_nv(%rsp),%rax # get number of values
+
+ mov p_temp2(%rsp),%ecx
+ mov %ecx,(%rdi) # we know there's at least one
+ mov p_temp3(%rsp),%edx
+ mov %edx,(%r12) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vrsacgf
+
+ mov p_temp2+4(%rsp),%ecx
+ mov %ecx,4(%rdi) # do the second value
+ mov p_temp3+4(%rsp),%edx
+ mov %edx,4(%r12) # do the second value
+ cmp $3,%rax
+ jl .L__vrsacgf
+
+ mov p_temp2+8(%rsp),%ecx
+ mov %ecx,8(%rdi) # do the third value
+ mov p_temp3+8(%rsp),%edx
+ mov %edx,8(%r12) # do the third value
+
+.L__vrsacgf:
+ jmp .L__final_check
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+.align 16
+.Lcoscos_coscos_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower and Upper Even
+
+ movapd %xmm4,%xmm8
+ movapd %xmm5,%xmm9
+
+ movapd %xmm6,%xmm4
+ movapd %xmm7,%xmm5
+
+ movapd %xmm8,%xmm6
+ movapd %xmm9,%xmm7
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+
+ movhlps %xmm5,%xmm9
+ movhlps %xmm7,%xmm13
+
+ movlhps %xmm9,%xmm7
+ movlhps %xmm13,%xmm5
+
+ movhlps %xmm4,%xmm8
+ movhlps %xmm6,%xmm12
+
+ movlhps %xmm8,%xmm6
+ movlhps %xmm12,%xmm4
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsincos_cossin_piby4:
+ movsd %xmm5,%xmm9
+ movsd %xmm7,%xmm13
+
+ movsd %xmm9,%xmm7
+ movsd %xmm13,%xmm5
+
+ movhlps %xmm4,%xmm8
+ movhlps %xmm6,%xmm12
+
+ movlhps %xmm8,%xmm6
+ movlhps %xmm12,%xmm4
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+ movsd %xmm5,%xmm9
+ movsd %xmm7,%xmm13
+
+ movsd %xmm9,%xmm7
+ movsd %xmm13,%xmm5
+
+ movsd %xmm4,%xmm8
+ movsd %xmm6,%xmm12
+
+ movsd %xmm8,%xmm6
+ movsd %xmm12,%xmm4
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+ movhlps %xmm5,%xmm9
+ movhlps %xmm7,%xmm13
+
+ movlhps %xmm9,%xmm7
+ movlhps %xmm13,%xmm5
+
+ movsd %xmm4,%xmm8
+ movsd %xmm6,%xmm12
+
+ movsd %xmm8,%xmm6
+ movsd %xmm12,%xmm4
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcoscos_sinsin_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower even, Upper odd, Swap upper
+
+ movapd %xmm5,%xmm9
+ movapd %xmm7,%xmm5
+ movapd %xmm9,%xmm7
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+# Cos in %xmm5,%xmm4
+# Sin in %xmm7,%xmm6
+# Lower odd, Upper even, Swap lower
+
+ movapd %xmm4,%xmm8
+ movapd %xmm6,%xmm4
+ movapd %xmm8,%xmm6
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcoscos_cossin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+ movapd %xmm5,%xmm9
+ movapd %xmm7,%xmm5
+ movapd %xmm9,%xmm7
+
+ movhlps %xmm4,%xmm8
+ movhlps %xmm6,%xmm12
+
+ movlhps %xmm8,%xmm6
+ movlhps %xmm12,%xmm4
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcoscos_sincos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+ movapd %xmm5,%xmm9
+ movapd %xmm7,%xmm5
+ movapd %xmm9,%xmm7
+
+ movsd %xmm4,%xmm8
+ movsd %xmm6,%xmm12
+
+ movsd %xmm8,%xmm6
+ movsd %xmm12,%xmm4
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcossin_coscos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+
+ movapd %xmm4,%xmm8
+ movapd %xmm6,%xmm4
+ movapd %xmm8,%xmm6
+
+ movhlps %xmm5,%xmm9
+ movhlps %xmm7,%xmm13
+
+ movlhps %xmm9,%xmm7
+ movlhps %xmm13,%xmm5
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+ movhlps %xmm5,%xmm9
+ movhlps %xmm7,%xmm13
+
+ movlhps %xmm9,%xmm7
+ movlhps %xmm13,%xmm5
+
+ jmp .L__vrsa_sincosf_cleanup
+
+
+.align 16
+.Lsincos_coscos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+ movapd %xmm4,%xmm8
+ movapd %xmm6,%xmm4
+ movapd %xmm8,%xmm6
+
+ movsd %xmm5,%xmm9
+ movsd %xmm7,%xmm13
+
+ movsd %xmm9,%xmm7
+ movsd %xmm13,%xmm5
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+ movsd %xmm5,%xmm9
+ movsd %xmm7,%xmm5
+ movsd %xmm9,%xmm7
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+ movhlps %xmm4,%xmm8
+ movhlps %xmm6,%xmm12
+
+ movlhps %xmm8,%xmm6
+ movlhps %xmm12,%xmm4
+
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+ movsd %xmm4,%xmm8
+ movsd %xmm6,%xmm4
+ movsd %xmm8,%xmm6
+ jmp .L__vrsa_sincosf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+# Cos in xmm4 and xmm5
+# Sin in xmm6 and xmm7
+# Lower and Upper odd, So Swap
+
+ jmp .L__vrsa_sincosf_cleanup
diff --git a/src/gas/vrsasinf.S b/src/gas/vrsasinf.S
new file mode 100644
index 0000000..6cbff59
--- /dev/null
+++ b/src/gas/vrsasinf.S
@@ -0,0 +1,2441 @@
+
+#
+# (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+#
+# This file is part of libacml_mv.
+#
+# libacml_mv is free software; you can redistribute it and/or
+# modify it under the terms of the GNU Lesser General Public
+# License as published by the Free Software Foundation; either
+# version 2.1 of the License, or (at your option) any later version.
+#
+# libacml_mv is distributed in the hope that it will be useful,
+# but WITHOUT ANY WARRANTY; without even the implied warranty of
+# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+# Lesser General Public License for more details.
+#
+# You should have received a copy of the GNU Lesser General Public
+# License along with libacml_mv. If not, see
+# <http://www.gnu.org/licenses/>.
+#
+#
+
+
+
+
+
+#
+# vrsasinf.s
+#
+# A vector implementation of the sin libm function.
+#
+# Prototype:
+#
+# vrsa_sinf(int n, float* x, float* y);
+#
+# Computes Sine of x for an array of input values.
+# Places the results into the supplied y array.
+# Does not perform error checking.
+# Denormal inputs may produce unexpected results.
+# This inlines a routine that computes 4 single precision Sine values at a time.
+# The four values are passed as packed single in xmm10.
+# The four results are returned as packed singles in xmm10.
+# Note that this represents a non-standard ABI usage, as no ABI
+# ( and indeed C) currently allows returning 2 values for a function.
+# It is expected that some compilers may be able to take advantage of this
+# interface when implementing vectorized loops. Using the array implementation
+# of the routine requires putting the inputs into memory, and retrieving
+# the results from memory. This routine eliminates the need for this
+# overhead if the data does not already reside in memory.
+# Author: Harsha Jagasia
+# Email: harsha.jagasia@amd.com
+
+#ifdef __ELF__
+.section .note.GNU-stack,"",@progbits
+#endif
+
+.data
+.align 64
+.L__real_7fffffffffffffff: .quad 0x07fffffffffffffff #Sign bit zero
+ .quad 0x07fffffffffffffff
+.L__real_3ff0000000000000: .quad 0x03ff0000000000000 # 1.0
+ .quad 0x03ff0000000000000
+.L__real_v2p__27: .quad 0x03e40000000000000 # 2p-27
+ .quad 0x03e40000000000000
+.L__real_3fe0000000000000: .quad 0x03fe0000000000000 # 0.5
+ .quad 0x03fe0000000000000
+.L__real_3fc5555555555555: .quad 0x03fc5555555555555 # 0.166666666666
+ .quad 0x03fc5555555555555
+.L__real_3fe45f306dc9c883: .quad 0x03fe45f306dc9c883 # twobypi
+ .quad 0x03fe45f306dc9c883
+.L__real_3ff921fb54400000: .quad 0x03ff921fb54400000 # piby2_1
+ .quad 0x03ff921fb54400000
+.L__real_3dd0b4611a626331: .quad 0x03dd0b4611a626331 # piby2_1tail
+ .quad 0x03dd0b4611a626331
+.L__real_3dd0b4611a600000: .quad 0x03dd0b4611a600000 # piby2_2
+ .quad 0x03dd0b4611a600000
+.L__real_3ba3198a2e037073: .quad 0x03ba3198a2e037073 # piby2_2tail
+ .quad 0x03ba3198a2e037073
+.L__real_fffffffff8000000: .quad 0x0fffffffff8000000 # mask for stripping head and tail
+ .quad 0x0fffffffff8000000
+.L__real_8000000000000000: .quad 0x08000000000000000 # -0 or signbit
+ .quad 0x08000000000000000
+.L__reald_one_one: .quad 0x00000000100000001 #
+ .quad 0
+.L__reald_two_two: .quad 0x00000000200000002 #
+ .quad 0
+.L__reald_one_zero: .quad 0x00000000100000000 # sin_cos_filter
+ .quad 0
+.L__reald_zero_one: .quad 0x00000000000000001 #
+ .quad 0
+.L__reald_two_zero: .quad 0x00000000200000000 #
+ .quad 0
+.L__realq_one_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+.L__realq_two_two: .quad 0x00000000000000002 #
+ .quad 0x00000000000000002 #
+.L__real_1_x_mask: .quad 0x0ffffffffffffffff #
+ .quad 0x03ff0000000000000 #
+.L__real_zero: .quad 0x00000000000000000 #
+ .quad 0x00000000000000000 #
+.L__real_one: .quad 0x00000000000000001 #
+ .quad 0x00000000000000001 #
+
+.Lcosarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03FA5555555502F31
+ .quad 0x0BF56C16BF55699D7 # -0.00138889 c2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x03EFA015C50A93B49 # 2.48016e-005 c3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x0BE92524743CC46B8 # -2.75573e-007 c4
+ .quad 0x0BE92524743CC46B8
+
+.Lsinarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BFC555555545E87D
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x03F811110DF01232D
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x03EC6DBE4AD1572D5
+
+.Lsincosarray:
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x03F811110DF01232D # 0.00833333 s2
+ .quad 0x0BF56C16BF55699D7
+ .quad 0x0BF2A013A88A37196 # -0.000198413 s3
+ .quad 0x03EFA015C50A93B49
+ .quad 0x03EC6DBE4AD1572D5 # 2.75573e-006 s4
+ .quad 0x0BE92524743CC46B8
+
+.Lcossinarray:
+ .quad 0x03FA5555555502F31 # 0.0416667 c1
+ .quad 0x0BFC555555545E87D # -0.166667 s1
+ .quad 0x0BF56C16BF55699D7 # c2
+ .quad 0x03F811110DF01232D
+ .quad 0x03EFA015C50A93B49 # c3
+ .quad 0x0BF2A013A88A37196
+ .quad 0x0BE92524743CC46B8 # c4
+ .quad 0x03EC6DBE4AD1572D5
+
+.align 8
+ .Levensin_oddcos_tbl:
+
+ .quad .Lsinsin_sinsin_piby4 # 0 * ; Done
+ .quad .Lsinsin_sincos_piby4 # 1 + ; Done
+ .quad .Lsinsin_cossin_piby4 # 2 ; Done
+ .quad .Lsinsin_coscos_piby4 # 3 + ; Done
+
+ .quad .Lsincos_sinsin_piby4 # 4 ; Done
+ .quad .Lsincos_sincos_piby4 # 5 * ; Done
+ .quad .Lsincos_cossin_piby4 # 6 ; Done
+ .quad .Lsincos_coscos_piby4 # 7 ; Done
+
+ .quad .Lcossin_sinsin_piby4 # 8 ; Done
+ .quad .Lcossin_sincos_piby4 # 9 ; TBD
+ .quad .Lcossin_cossin_piby4 # 10 * ; Done
+ .quad .Lcossin_coscos_piby4 # 11 ; Done
+
+ .quad .Lcoscos_sinsin_piby4 # 12 ; Done
+ .quad .Lcoscos_sincos_piby4 # 13 + ; Done
+ .quad .Lcoscos_cossin_piby4 # 14 ; Done
+ .quad .Lcoscos_coscos_piby4 # 15 * ; Done
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+ .weak vrsa_sinf_
+ .set vrsa_sinf_,__vrsa_sinf__
+ .weak vrsa_sinf__
+ .set vrsa_sinf__,__vrsa_sinf__
+
+ .text
+ .align 16
+ .p2align 4,,15
+
+#FORTRAN subroutine implementation of array sin
+#VRSA_SINF(N,X,Y)
+#C equivalent*/
+#void vrsa_sinf__(int * n, double *x, double *y)
+#{
+# vrsa_sinf(*n,x,y);
+#}
+
+.globl __vrsa_sinf__
+ .type __vrsa_sinf__,@function
+__vrsa_sinf__:
+ mov (%rdi),%edi
+
+ .align 16
+ .p2align 4,,15
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# define local variable storage offsets
+.equ p_temp,0 # temporary for get/put bits operation
+.equ p_temp1,0x10 # temporary for get/put bits operation
+
+.equ save_xmm6,0x20 # temporary for get/put bits operation
+.equ save_xmm7,0x30 # temporary for get/put bits operation
+.equ save_xmm8,0x40 # temporary for get/put bits operation
+.equ save_xmm9,0x50 # temporary for get/put bits operation
+.equ save_xmm0,0x60 # temporary for get/put bits operation
+.equ save_xmm11,0x70 # temporary for get/put bits operation
+.equ save_xmm12,0x80 # temporary for get/put bits operation
+.equ save_xmm13,0x90 # temporary for get/put bits operation
+.equ save_xmm14,0x0A0 # temporary for get/put bits operation
+.equ save_xmm15,0x0B0 # temporary for get/put bits operation
+
+.equ r,0x0C0 # pointer to r for remainder_piby2
+.equ rr,0x0D0 # pointer to r for remainder_piby2
+.equ region,0x0E0 # pointer to r for remainder_piby2
+
+.equ r1,0x0F0 # pointer to r for remainder_piby2
+.equ rr1,0x0100 # pointer to r for remainder_piby2
+.equ region1,0x0110 # pointer to r for remainder_piby2
+
+.equ p_temp2,0x0120 # temporary for get/put bits operation
+.equ p_temp3,0x0130 # temporary for get/put bits operation
+
+.equ p_temp4,0x0140 # temporary for get/put bits operation
+.equ p_temp5,0x0150 # temporary for get/put bits operation
+
+.equ p_original,0x0160 # original x
+.equ p_mask,0x0170 # original x
+.equ p_sign,0x0180 # original x
+
+.equ p_original1,0x0190 # original x
+.equ p_mask1,0x01A0 # original x
+.equ p_sign1,0x01B0 # original x
+
+.equ save_r12,0x01C0 # temporary for get/put bits operation
+.equ save_r13,0x01D0 # temporary for get/put bits operation
+
+.equ save_xa,0x01E0 #qword
+.equ save_ya,0x01F0 #qword
+
+.equ save_nv,0x0200 #qword
+.equ p_iter,0x0210 # qword storage for number of loop iterations
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.globl vrsa_sinf
+ .type vrsa_sinf,@function
+vrsa_sinf:
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# parameters are passed in by Linux as:
+# rcx - int n
+# rdx - double *x
+# r8 - double *y
+
+ sub $0x0228,%rsp
+ mov %r12,save_r12(%rsp) # save r12
+ mov %r13,save_r13(%rsp) # save r13
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START PROCESS INPUT
+# save the arguments
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+ mov %rdx,save_ya(%rsp) # save y_array pointer
+#ifdef INTEGER64
+ mov %rdi,%rax
+#else
+ mov %edi,%eax
+ mov %rax,%rdi
+#endif
+ mov %rdi,save_nv(%rsp) # save number of values
+# see if too few values to call the main loop
+ shr $2,%rax # get number of iterations
+ jz .L__vrsa_cleanup # jump if only single calls
+# prepare the iteration counts
+ mov %rax,p_iter(%rsp) # save number of iterations
+ shl $2,%rax
+ sub %rax,%rdi # compute number of extra single calls
+ mov %rdi,save_nv(%rsp) # save number of left over values
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#START LOOP
+.align 16
+.L__vrsa_top:
+# build the input _m128d
+ mov save_xa(%rsp),%rsi # get x_array pointer
+ movlps (%rsi),%xmm0
+ movhps 8(%rsi),%xmm0
+
+ prefetch 32(%rsi)
+ add $16,%rsi
+ mov %rsi,save_xa(%rsp) # save x_array pointer
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# V4 START
+ movhlps %xmm0,%xmm8
+ cvtps2pd %xmm0,%xmm10 # convert input to double.
+ cvtps2pd %xmm8,%xmm1 # convert input to double.
+
+movdqa %xmm10,%xmm6
+movdqa %xmm1,%xmm7
+movapd .L__real_7fffffffffffffff(%rip),%xmm2
+
+andpd %xmm2,%xmm10 #Unsign
+andpd %xmm2,%xmm1 #Unsign
+
+movd %xmm10,%rax #rax is lower arg
+movhpd %xmm10, p_temp+8(%rsp) #
+mov p_temp+8(%rsp),%rcx #rcx = upper arg
+
+movd %xmm1,%r8 #r8 is lower arg
+movhpd %xmm1, p_temp1+8(%rsp) #
+mov p_temp1+8(%rsp),%r9 #r9 = upper arg
+
+movdqa %xmm10,%xmm12
+movdqa %xmm1,%xmm13
+
+pcmpgtd %xmm6,%xmm12
+pcmpgtd %xmm7,%xmm13
+movdqa %xmm12,%xmm6
+movdqa %xmm13,%xmm7
+psrldq $4,%xmm12
+psrldq $4,%xmm13
+psrldq $8,%xmm6
+psrldq $8,%xmm7
+
+mov $0x3FE921FB54442D18,%rdx #piby4 +
+mov $0x411E848000000000,%r10 #5e5 +
+movapd .L__real_3fe0000000000000(%rip),%xmm4 #0.5 for later use +
+
+por %xmm6,%xmm12
+por %xmm7,%xmm13
+movd %xmm12,%r12 #Move Sign to gpr **
+movd %xmm13,%r13 #Move Sign to gpr **
+
+movapd %xmm10,%xmm2 #x0
+movapd %xmm1,%xmm3 #x1
+movapd %xmm10,%xmm6 #x0
+movapd %xmm1,%xmm7 #x1
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm2 = x, xmm4 =0.5/t, xmm6 =x
+# xmm3 = x, xmm5 =0.5/t, xmm7 =x
+.align 16
+.Leither_or_both_arg_gt_than_piby4:
+ cmp %r10,%rax
+ jae .Lfirst_or_next3_arg_gt_5e5
+
+ cmp %r10,%rcx
+ jae .Lsecond_or_next2_arg_gt_5e5
+
+ cmp %r10,%r8
+ jae .Lthird_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfourth_arg_gt_5e5
+
+
+# /* Find out what multiple of piby2 */
+# npi2 = (int)(x * twobypi + 0.5);
+ movapd .L__real_3fe45f306dc9c883(%rip),%xmm10
+ mulpd %xmm10,%xmm2 # * twobypi
+ mulpd %xmm10,%xmm3 # * twobypi
+
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ addpd %xmm4,%xmm3 # +0.5, npi2
+
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+# /* Subtract the multiple from x to get an extra-precision remainder */
+
+ movd %xmm4,%r8 # Region
+ movd %xmm5,%r9 # Region
+
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+ mov %r8,%r10
+ mov %r9,%r11
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm1,%xmm7 # t-rhead
+
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = npi2 (int), xmm10 =rhead, xmm8 =rtail
+# xmm5 = npi2 (int), xmm1 =rhead, xmm9 =rtail
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+# GET_BITS_DP64(rhead-rtail, uy); ; originally only rhead
+# xmm4 = Sign, xmm10 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 =rhead, xmm9 =rtail
+ movapd %xmm10,%xmm6 # rhead
+ movapd %xmm1,%xmm7 # rhead
+
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# xmm4 = Sign, xmm10 = r, xmm6 =rhead, xmm8 =rtail
+# xmm5 = Sign, xmm1 = r, xmm7 =rhead, xmm9 =rtail
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2 # move r for r2
+ movapd %xmm1,%xmm3 # move r for r2
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+ lea .Levensin_oddcos_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfirst_or_next3_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+
+ cmp %r10,%rcx #is upper arg >= 5e5
+ jae .Lboth_arg_gt_5e5
+
+.Llower_arg_gt_5e5:
+# Upper Arg is < 5e5, Lower arg is >= 5e5
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Be sure not to use %xmm3,%xmm1 and xmm7
+# Use %xmm8,,%xmm5 xmm0, xmm12
+# %xmm11,,%xmm9 xmm13
+
+
+ movlpd %xmm10,r(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm10,%xmm10 #Needed since we want to work on upper arg
+ movhlps %xmm2,%xmm2
+ movhlps %xmm6,%xmm6
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm8 = piby2_1
+ cvttsd2si %xmm2,%ecx # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm0 = piby2_2
+ cvtsi2sd %ecx,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm12 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %ecx,region+4(%rsp) # store upper region
+ movsd %xmm6,%xmm10
+ subsd %xmm0,%xmm10 # xmm10 = r=(rhead-rtail)
+ movlpd %xmm10,r+8(%rsp) # store upper r
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_lower_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r(%rsp),%rdi #Restore lower fp arg for remainder_piby2 call
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_sinf_lower_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lboth_arg_gt_5e5:
+#Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+ movhlps %xmm10,%xmm6 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %rax,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_lower_naninf_of_both_gt_5e5
+
+ mov %rcx,p_temp(%rsp) #Save upper arg
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r(%rsp),%rsi
+
+# added ins- changed input from xmm10 to xmm0
+ movd %xmm10,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ mov p_temp(%rsp),%rcx #Restore upper arg
+ jmp 0f
+
+.L__vrs4_sinf_lower_naninf_of_both_gt_5e5: #lower arg is nan/inf
+# mov .LQWORD,%rax PTR p_original[rsp]
+ mov $0x00008000000000000,%r11
+ or %r11,%rax
+ mov %rax,r(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_upper_naninf_of_both_gt_5e5
+
+
+ mov %r8,p_temp2(%rsp)
+ mov %r9,p_temp4(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm6,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp2(%rsp),%r8
+ mov p_temp4(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+
+ jmp 0f
+
+.L__vrs4_sinf_upper_naninf_of_both_gt_5e5:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lsecond_or_next2_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Restore xmm4 and %xmm3,,%xmm1 xmm7
+# Can use %xmm0,,%xmm8 xmm12
+# %xmm9,,%xmm5 xmm11, xmm13
+
+ movhpd %xmm10,r+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm2 # x*twobypi
+ addsd %xmm4,%xmm2 # xmm2 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm8 # xmm3 = piby2_1
+ cvttsd2si %xmm2,%eax # ecx = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm0 # xmm1 = piby2_2
+ cvtsi2sd %eax,%xmm2 # xmm2 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm2,%xmm8 # npi2 * piby2_1
+ subsd %xmm8,%xmm6 # xmm6 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm12 # xmm7 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm6,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm2,%xmm0 # xmm1 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm0,%xmm6 # xmm6 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm2,%xmm12 # npi2 * piby2_2tail
+ subsd %xmm6,%xmm5 # t-rhead
+ subsd %xmm5,%xmm0 # (rtail-(t-rhead))
+ addsd %xmm12,%xmm0 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %eax,region(%rsp) # store upper region
+
+ subsd %xmm0,%xmm6 # xmm10 = r=(rhead-rtail)
+
+ movlpd %xmm6,r(%rsp) # store upper r
+
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %rcx,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_upper_naninf
+
+ mov %r8,p_temp(%rsp)
+ mov %r9,p_temp2(%rsp)
+ movapd %xmm1,p_temp1(%rsp)
+ movapd %xmm3,p_temp3(%rsp)
+ movapd %xmm7,p_temp5(%rsp)
+
+ lea region+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r+8(%rsp),%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp(%rsp),%r8
+ mov p_temp2(%rsp),%r9
+ movapd p_temp1(%rsp),%xmm1
+ movapd p_temp3(%rsp),%xmm3
+ movapd p_temp5(%rsp),%xmm7
+ jmp 0f
+
+.L__vrs4_sinf_upper_naninf:
+ mov $0x00008000000000000,%r11
+ or %r11,%rcx
+ mov %rcx,r+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .Lcheck_next2_args
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcheck_next2_args:
+
+ mov $0x411E848000000000,%r10 #5e5 +
+
+ cmp %r10,%r8
+ jae .Lfirst_second_done_third_or_fourth_arg_gt_5e5
+
+ cmp %r10,%r9
+ jae .Lfirst_second_done_fourth_arg_gt_5e5
+
+
+
+# Work on next two args, both < 5e5
+# %xmm3,,%xmm1 xmm5 = x, xmm4 = 0.5
+
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 #Restore 0.5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm3 # * twobypi
+ addpd %xmm4,%xmm3 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm1 # piby2_1
+ cvttpd2dq %xmm3,%xmm5 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm9 # piby2_2
+ cvtdq2pd %xmm5,%xmm3 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm5,region1(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm3,%xmm1 # npi2 * piby2_1;
+
+# rtail = npi2 * piby2_2;
+ mulpd %xmm3,%xmm9 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm1,%xmm7 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm7,%xmm1 # t
+
+# rhead = t - rtail;
+ subpd %xmm9,%xmm1 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm3 # npi2 * piby2_2tail
+
+ subpd %xmm1,%xmm7 # t-rhead
+ subpd %xmm7,%xmm9 # - ((t - rhead) - rtail)
+ addpd %xmm3,%xmm9 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm1,%xmm7 ; rhead
+ subpd %xmm9,%xmm1 # r = rhead - rtail
+ movapd %xmm1,r1(%rsp)
+
+# subpd %xmm1,%xmm7 ; rr=rhead-r
+# subpd xmm7, xmm9 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr1[rsp], xmm7
+
+ jmp .L__vrs4_sinf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lthird_or_fourth_arg_gt_5e5:
+#first two args are < 5e5, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+# Do not use %xmm3,,%xmm1 xmm7
+# Can use %xmm11,,%xmm9 xmm13
+# %xmm8,,%xmm5 xmm0, xmm12
+# Restore xmm4
+
+# Work on first two args, both < 5e5
+
+
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm10,%xmm6 ; rhead
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# subpd %xmm10,%xmm6 ; rr=rhead-r
+# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg >= 5e5, fourth arg >= 5e5 or < 5e5
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_third_or_fourth_arg_gt_5e5:
+# %rcx,,%rax r8, r9
+# %xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+
+ mov $0x411E848000000000,%r10 #5e5 +
+ cmp %r10,%r9
+ jae .Lboth_arg_gt_5e5_higher
+
+
+# Upper Arg is <5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movlpd %xmm1,r1(%rsp) #Save lower fp arg for remainder_piby2 call
+ movhlps %xmm1,%xmm1 #Needed since we want to work on upper arg
+ movhlps %xmm3,%xmm3
+ movhlps %xmm7,%xmm7
+
+
+# Work on Upper arg
+# Lower arg might contain nan/inf, to avoid exception use only scalar instructions on upper arg which has been moved to lower portions of fp regs
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r9d # r9d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r9d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r9d,region1+4(%rsp) # store upper region
+
+ subsd %xmm10,%xmm7 # xmm1 = r=(rhead-rtail)
+
+ movlpd %xmm7,r1+8(%rsp) # store upper r
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+
+# Work on Lower arg
+ mov $0x07ff0000000000000,%r11 # is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_lower_naninf_higher
+
+ lea region1(%rsp),%rdx # lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1(%rsp),%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sinf_lower_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_sinf_reconstruct
+
+
+
+
+
+
+
+.align 16
+.Lboth_arg_gt_5e5_higher:
+# Upper Arg is >= 5e5, Lower arg is >= 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+
+ movhlps %xmm1,%xmm7 #Save upper fp arg for remainder_piby2 call
+
+ mov $0x07ff0000000000000,%r11 #is lower arg nan/inf
+ mov %r11,%r10
+ and %r8,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher
+
+ mov %r9,p_temp1(%rsp) #Save upper arg
+ lea region1(%rsp),%rdx #lower arg is **NOT** nan/inf
+ lea r1(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm1,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ mov p_temp1(%rsp),%r9 #Restore upper arg
+
+ jmp 0f
+
+.L__vrs4_sinf_lower_naninf_of_both_gt_5e5_higher: #lower arg is nan/inf
+ mov $0x00008000000000000,%r11
+ or %r11,%r8
+ mov %r8,r1(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1(%rsp) #region = 0
+
+.align 16
+0:
+ mov $0x07ff0000000000000,%r11 #is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher
+
+ lea region1+4(%rsp),%rdx #upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ movd %xmm7,%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sinf_upper_naninf_of_both_gt_5e5_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) #r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) #region = 0
+
+.align 16
+0:
+ jmp .L__vrs4_sinf_reconstruct
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lfourth_arg_gt_5e5:
+#first two args are < 5e5, third arg < 5e5, fourth arg >= 5e5
+#%rcx,,%rax r8, r9
+#%xmm2,,%xmm10 xmm6 = x, xmm4 = 0.5
+
+# Work on first two args, both < 5e5
+
+ mulpd .L__real_3fe45f306dc9c883(%rip),%xmm2 # * twobypi
+ addpd %xmm4,%xmm2 # +0.5, npi2
+ movapd .L__real_3ff921fb54400000(%rip),%xmm10 # piby2_1
+ cvttpd2dq %xmm2,%xmm4 # convert packed double to packed integers
+ movapd .L__real_3dd0b4611a600000(%rip),%xmm8 # piby2_2
+ cvtdq2pd %xmm4,%xmm2 # and back to double.
+
+###
+# /* Subtract the multiple from x to get an extra-precision remainder */
+ movlpd %xmm4,region(%rsp) # Region
+###
+
+# rhead = x - npi2 * piby2_1;
+ mulpd %xmm2,%xmm10 # npi2 * piby2_1;
+# rtail = npi2 * piby2_2;
+ mulpd %xmm2,%xmm8 # rtail
+
+# rhead = x - npi2 * piby2_1;
+ subpd %xmm10,%xmm6 # rhead = x - npi2 * piby2_1;
+
+# t = rhead;
+ movapd %xmm6,%xmm10 # t
+
+# rhead = t - rtail;
+ subpd %xmm8,%xmm10 # rhead
+
+# rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulpd .L__real_3ba3198a2e037073(%rip),%xmm2 # npi2 * piby2_2tail
+
+ subpd %xmm10,%xmm6 # t-rhead
+ subpd %xmm6,%xmm8 # - ((t - rhead) - rtail)
+ addpd %xmm2,%xmm8 # rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+
+# movapd %xmm10,%xmm6 ; rhead
+ subpd %xmm8,%xmm10 # r = rhead - rtail
+ movapd %xmm10,r(%rsp)
+
+# subpd %xmm10,%xmm6 ; rr=rhead-r
+# subpd xmm6, xmm8 ; rr=(rhead-r) -rtail
+# movapd OWORD PTR rr[rsp], xmm6
+
+
+# Work on next two args, third arg < 5e5, fourth arg >= 5e5
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lfirst_second_done_fourth_arg_gt_5e5:
+
+# Upper Arg is >= 5e5, Lower arg is < 5e5
+# %r9,%r8
+# %xmm3,,%xmm1 xmm7 = x, xmm4 = 0.5
+
+ movhpd %xmm1,r1+8(%rsp) #Save upper fp arg for remainder_piby2 call
+
+# Work on Lower arg
+# Upper arg might contain nan/inf, to avoid exception use only scalar instructions on lower arg
+ movapd .L__real_3fe0000000000000(%rip),%xmm4 # Restore 0.5
+ mulsd .L__real_3fe45f306dc9c883(%rip),%xmm3 # x*twobypi
+ addsd %xmm4,%xmm3 # xmm3 = npi2=(x*twobypi+0.5)
+ movsd .L__real_3ff921fb54400000(%rip),%xmm2 # xmm2 = piby2_1
+ cvttsd2si %xmm3,%r8d # r8d = npi2 trunc to ints
+ movsd .L__real_3dd0b4611a600000(%rip),%xmm10 # xmm10 = piby2_2
+ cvtsi2sd %r8d,%xmm3 # xmm3 = npi2 trunc to doubles
+
+#/* Subtract the multiple from x to get an extra-precision remainder */
+#rhead = x - npi2 * piby2_1;
+ mulsd %xmm3,%xmm2 # npi2 * piby2_1
+ subsd %xmm2,%xmm7 # xmm7 = rhead =(x-npi2*piby2_1)
+ movsd .L__real_3ba3198a2e037073(%rip),%xmm6 # xmm6 =piby2_2tail
+
+#t = rhead;
+ movsd %xmm7,%xmm5 # xmm5 = t = rhead
+
+#rtail = npi2 * piby2_2;
+ mulsd %xmm3,%xmm10 # xmm10 =rtail=(npi2*piby2_2)
+
+#rhead = t - rtail
+ subsd %xmm10,%xmm7 # xmm7 =rhead=(t-rtail)
+
+#rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ mulsd %xmm3,%xmm6 # npi2 * piby2_2tail
+ subsd %xmm7,%xmm5 # t-rhead
+ subsd %xmm5,%xmm10 # (rtail-(t-rhead))
+ addsd %xmm6,%xmm10 # rtail=npi2*piby2_2tail+(rtail-(t-rhead));
+
+#r = rhead - rtail
+#rr = (rhead-r) -rtail
+ mov %r8d,region1(%rsp) # store lower region
+
+# movsd %xmm7,%xmm1
+# subsd xmm1, xmm10 ; xmm10 = r=(rhead-rtail)
+# subsd %xmm1,%xmm7 ; rr=rhead-r
+# subsd xmm7, xmm10 ; xmm6 = rr=((rhead-r) -rtail)
+
+ subsd %xmm10,%xmm7 # xmm10 = r=(rhead-rtail)
+
+# movlpd QWORD PTR r1[rsp], xmm1 ; store upper r
+# movlpd QWORD PTR rr1[rsp], xmm7 ; store upper rr
+
+ movlpd %xmm7,r1(%rsp) # store upper r
+
+#Work on Upper arg
+#Note that volatiles will be trashed by the call
+#We do not care since this is the last check
+#We will construct r, rr, region and sign
+ mov $0x07ff0000000000000,%r11 # is upper arg nan/inf
+ mov %r11,%r10
+ and %r9,%r10
+ cmp %r11,%r10
+ jz .L__vrs4_sinf_upper_naninf_higher
+
+ lea region1+4(%rsp),%rdx # upper arg is **NOT** nan/inf
+ lea r1+8(%rsp),%rsi
+
+# changed input from xmm10 to xmm0
+ mov r1+8(%rsp),%rdi
+
+ call __remainder_piby2d2f@PLT
+
+ jmp 0f
+
+.L__vrs4_sinf_upper_naninf_higher:
+ mov $0x00008000000000000,%r11
+ or %r11,%r9
+ mov %r9,r1+8(%rsp) # r = x | 0x0008000000000000
+ mov %r10d,region1+4(%rsp) # region =0
+
+.align 16
+0:
+ jmp .L__vrs4_sinf_reconstruct
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrs4_sinf_reconstruct:
+#Results
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd r(%rsp),%xmm10
+ movapd r1(%rsp),%xmm1
+
+ mov region(%rsp),%r8
+ mov region1(%rsp),%r9
+ mov .L__reald_one_zero(%rip),%rdx #compare value for cossin path
+
+ mov %r8,%r10
+ mov %r9,%r11
+
+ and .L__reald_one_one(%rip),%r8 #odd/even region for cos/sin
+ and .L__reald_one_one(%rip),%r9 #odd/even region for cos/sin
+
+ shr $1,%r10 #~AB+A~B, A is sign and B is upper bit of region
+ shr $1,%r11 #~AB+A~B, A is sign and B is upper bit of region
+
+ mov %r10,%rax
+ mov %r11,%rcx
+
+ not %r12 #ADDED TO CHANGE THE LOGIC
+ not %r13 #ADDED TO CHANGE THE LOGIC
+ and %r12,%r10
+ and %r13,%r11
+
+ not %rax
+ not %rcx
+ not %r12
+ not %r13
+ and %r12,%rax
+ and %r13,%rcx
+
+ or %rax,%r10
+ or %rcx,%r11
+ and .L__reald_one_one(%rip),%r10 #(~AB+A~B)&1
+ and .L__reald_one_one(%rip),%r11 #(~AB+A~B)&1
+
+ mov %r10,%r12
+ mov %r11,%r13
+
+ and %rdx,%r12 #mask out the lower sign bit leaving the upper sign bit
+ and %rdx,%r13 #mask out the lower sign bit leaving the upper sign bit
+
+ shl $63,%r10 #shift lower sign bit left by 63 bits
+ shl $63,%r11 #shift lower sign bit left by 63 bits
+ shl $31,%r12 #shift upper sign bit left by 31 bits
+ shl $31,%r13 #shift upper sign bit left by 31 bits
+
+ mov %r10,p_sign(%rsp) #write out lower sign bit
+ mov %r12,p_sign+8(%rsp) #write out upper sign bit
+ mov %r11,p_sign1(%rsp) #write out lower sign bit
+ mov %r13,p_sign1+8(%rsp) #write out upper sign bit
+
+ mov %r8,%rax
+ mov %r9,%rcx
+
+ movapd %xmm10,%xmm2
+ movapd %xmm1,%xmm3
+
+ mulpd %xmm10,%xmm2 # r2
+ mulpd %xmm1,%xmm3 # r2
+
+ and .L__reald_zero_one(%rip),%rax
+ and .L__reald_zero_one(%rip),%rcx
+ shr $31,%r8
+ shr $31,%r9
+ or %r8,%rax
+ or %r9,%rcx
+ shl $2,%rcx
+ or %rcx,%rax
+
+
+ lea .Levensin_oddcos_tbl(%rip),%rcx
+ jmp *(%rcx,%rax,8) #Jmp table for cos/sin calculation based on even/odd region
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.L__vrsa_sinf_cleanup:
+
+ movapd p_sign(%rsp),%xmm10
+ movapd p_sign1(%rsp),%xmm1
+
+ xorpd %xmm4,%xmm10 # (+) Sign
+ xorpd %xmm5,%xmm1 # (+) Sign
+
+ cvtpd2ps %xmm10,%xmm0
+ cvtpd2ps %xmm1,%xmm11
+ movlhps %xmm11,%xmm0
+
+# NEW
+
+.L__vrsa_bottom1:
+# store the result _m128d
+ mov save_ya(%rsp),%rdi # get y_array pointer
+ movlps %xmm0,(%rdi)
+ movhps %xmm0,8(%rdi)
+
+ prefetch 32(%rdi)
+ add $16,%rdi
+ mov %rdi,save_ya(%rsp) # save y_array pointer
+
+ mov p_iter(%rsp),%rax # get number of iterations
+ sub $1,%rax
+ mov %rax,p_iter(%rsp) # save number of iterations
+ jnz .L__vrsa_top
+
+# see if we need to do any extras
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax
+ jnz .L__vrsa_cleanup
+
+.L__final_check:
+
+# NEW
+
+ mov save_r12(%rsp),%r12 # restore r12
+ mov save_r13(%rsp),%r13 # restore r13
+
+ add $0x0228,%rsp
+ ret
+
+#NEW
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# we jump here when we have an odd number of cos calls to make at the end
+# we assume that rdx is pointing at the next x array element, r8 at the next y array element.
+# The number of values left is in save_nv
+
+.align 16
+.L__vrsa_cleanup:
+ mov save_nv(%rsp),%rax # get number of values
+ test %rax,%rax # are there any values
+ jz .L__final_check # exit if not
+
+ mov save_xa(%rsp),%rsi
+ mov save_ya(%rsp),%rdi
+
+
+# START WORKING FROM HERE
+# fill in a m128d with zeroes and the extra values and then make a recursive call.
+ xorps %xmm0,%xmm0
+ movss %xmm0,p_temp+4(%rsp)
+ movlps %xmm0,p_temp+8(%rsp)
+
+
+ mov (%rsi),%ecx # we know there's at least one
+ mov %ecx,p_temp(%rsp)
+ cmp $2,%rax
+ jl .L__vrsacg
+
+ mov 4(%rsi),%ecx # do the second value
+ mov %ecx,p_temp+4(%rsp)
+ cmp $3,%rax
+ jl .L__vrsacg
+
+ mov 8(%rsi),%ecx # do the third value
+ mov %ecx,p_temp+8(%rsp)
+
+.L__vrsacg:
+ mov $4,%rdi # parameter for N
+ lea p_temp(%rsp),%rsi # &x parameter
+ lea p_temp2(%rsp),%rdx # &y parameter
+ call vrsa_sinf@PLT # call recursively to compute four values
+
+# now copy the results to the destination array
+ mov save_ya(%rsp),%rdi
+ mov save_nv(%rsp),%rax # get number of values
+
+ mov p_temp2(%rsp),%ecx
+ mov %ecx,(%rdi) # we know there's at least one
+ cmp $2,%rax
+ jl .L__vrsacgf
+
+ mov p_temp2+4(%rsp),%ecx
+ mov %ecx,4(%rdi) # do the second value
+ cmp $3,%rax
+ jl .L__vrsacgf
+
+ mov p_temp2+8(%rsp),%ecx
+ mov %ecx,8(%rdi) # do the third value
+
+.L__vrsacgf:
+ jmp .L__final_check
+
+#NEW
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;JUMP TABLE;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_coscos_piby4:
+ movapd %xmm2,%xmm0 # r
+ movapd %xmm3,%xmm11 # r
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c3*x2
+ mulpd %xmm3,%xmm9 # c3*x2
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0 ;trash r
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0 ;trash r
+
+ mulpd %xmm2,%xmm2 # x4
+ mulpd %xmm3,%xmm3 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm2,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm3,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x4 * zc
+
+ subpd %xmm0,%xmm4 # + t
+ subpd %xmm11,%xmm5 # + t
+
+ jmp .L__vrsa_sinf_cleanup
+
+.align 16
+.Lcossin_cossin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # s4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # s4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # s2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # s4+x2s3
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # s4+x2s3
+ addpd .Lsincosarray(%rip),%xmm8 # s2+x2s1
+ addpd .Lsincosarray(%rip),%xmm9 # s2+x2s1
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm0,%xmm0 # move high x4 for cos term
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term
+ movsd %xmm3,%xmm7 # move low x2 for x3 for sin term
+ mulsd %xmm10,%xmm6 # get low x3 for sin term
+ mulsd %xmm1,%xmm7 # get low x3 for sin term
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for sin and cos terms
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for sin and cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm2,%xmm12 # move high r for cos
+ movhlps %xmm3,%xmm13 # move high r for cos
+
+ movhlps %xmm4,%xmm8 # xmm4 = sin , xmm8 = cos
+ movhlps %xmm5,%xmm9 # xmm4 = sin , xmm8 = cos
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm7,%xmm5 # sin *x3
+
+ mulsd %xmm0,%xmm8 # cos *x4
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm13 #-t=r-1.0
+
+ addsd %xmm10,%xmm4 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+ subsd %xmm12,%xmm8 # cos+t
+ subsd %xmm13,%xmm9 # cos+t
+
+ movlhps %xmm8,%xmm4
+ movlhps %xmm9,%xmm5
+
+ jmp .L__vrsa_sinf_cleanup
+.align 16
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.Lsincos_cossin_piby4:
+
+ movapd .Lsincosarray+0x30(%rip),%xmm4 # s4
+ movapd .Lcossinarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lsincosarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm3,%xmm7 # sincos term upper x2 for x3
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # s3+x2s4
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # s3+x2s4
+ addpd .Lsincosarray(%rip),%xmm8 # s1+x2s2
+ addpd .Lcossinarray(%rip),%xmm9 # s1+x2s2
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm0,%xmm0 # move high x4 for cos term
+
+ movsd %xmm2,%xmm6 # move low x2 for x3 for sin term (cossin)
+ mulpd %xmm1,%xmm7
+
+ mulsd %xmm10,%xmm6 # get low x3 for sin term (cossin)
+ movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos)
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+
+ movhlps %xmm2,%xmm12 # move high r for cos (cossin)
+
+
+ movhlps %xmm4,%xmm8 # xmm8 = cos , xmm4 = sin (cossin)
+ movhlps %xmm5,%xmm9 # xmm9 = sin , xmm5 = cos (sincos)
+
+ mulsd %xmm6,%xmm4 # sin *x3
+ mulsd %xmm11,%xmm5 # cos *x4
+ mulsd %xmm0,%xmm8 # cos *x4
+ mulsd %xmm7,%xmm9 # sin *x3
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 # -t=r-1.0
+
+ movhlps %xmm1,%xmm11 # move high x for x for sin term (sincos)
+
+ addsd %xmm10,%xmm4 # sin + x +
+ addsd %xmm11,%xmm9 # sin + x +
+
+ subsd %xmm12,%xmm8 # cos+t
+ subsd %xmm3,%xmm5 # cos+t
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrsa_sinf_cleanup
+
+.align 16
+.Lsincos_sincos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd .Lcossinarray+0x30(%rip),%xmm4 # s4
+ movapd .Lcossinarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lcossinarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm2,%xmm6 # move x2 for x4
+ movapd %xmm3,%xmm7 # move x2 for x4
+
+ mulpd %xmm2,%xmm4 # x2s6
+ mulpd %xmm3,%xmm5 # x2s6
+ mulpd %xmm2,%xmm8 # x2s3
+ mulpd %xmm3,%xmm9 # x2s3
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # s4+x2s3
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # s4+x2s3
+ addpd .Lcossinarray(%rip),%xmm8 # s2+x2s1
+ addpd .Lcossinarray(%rip),%xmm9 # s2+x2s1
+
+ mulpd %xmm0,%xmm4 # x4(s4+x2s3)
+ mulpd %xmm11,%xmm5 # x4(s4+x2s3)
+
+ mulpd %xmm10,%xmm6 # get low x3 for sin term
+ mulpd %xmm1,%xmm7 # get low x3 for sin term
+ movhlps %xmm6,%xmm6 # move low x2 for x3 for sin term
+ movhlps %xmm7,%xmm7 # move low x2 for x3 for sin term
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos terms
+ mulsd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos terms
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+ movhlps %xmm4,%xmm12 # xmm8 = sin , xmm4 = cos
+ movhlps %xmm5,%xmm13 # xmm9 = sin , xmm5 = cos
+
+ mulsd %xmm6,%xmm12 # sin *x3
+ mulsd %xmm7,%xmm13 # sin *x3
+ mulsd %xmm0,%xmm4 # cos *x4
+ mulsd %xmm11,%xmm5 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 #-t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm3 #-t=r-1.0
+
+ movhlps %xmm10,%xmm0 # move high x for x for sin term
+ movhlps %xmm1,%xmm11 # move high x for x for sin term
+ # Reverse 10 and 0
+
+ addsd %xmm0,%xmm12 # sin + x
+ addsd %xmm11,%xmm13 # sin + x
+
+ subsd %xmm2,%xmm4 # cos+t
+ subsd %xmm3,%xmm5 # cos+t
+
+ movlhps %xmm12,%xmm4
+ movlhps %xmm13,%xmm5
+ jmp .L__vrsa_sinf_cleanup
+
+.align 16
+.Lcossin_sincos_piby4:
+
+ movapd .Lcossinarray+0x30(%rip),%xmm4 # s4
+ movapd .Lsincosarray+0x30(%rip),%xmm5 # s4
+ movdqa .Lcossinarray+0x10(%rip),%xmm8 # s2
+ movdqa .Lsincosarray+0x10(%rip),%xmm9 # s2
+
+ movapd %xmm2,%xmm0 # move x2 for x4
+ movapd %xmm3,%xmm11 # move x2 for x4
+ movapd %xmm2,%xmm7 # upper x2 for x3 for sin term (sincos)
+
+ mulpd %xmm2,%xmm4 # x2s4
+ mulpd %xmm3,%xmm5 # x2s4
+ mulpd %xmm2,%xmm8 # x2s2
+ mulpd %xmm3,%xmm9 # x2s2
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # s3+x2s4
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # s3+x2s4
+ addpd .Lcossinarray(%rip),%xmm8 # s1+x2s2
+ addpd .Lsincosarray(%rip),%xmm9 # s1+x2s2
+
+ mulpd %xmm0,%xmm4 # x4(s3+x2s4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ movhlps %xmm11,%xmm11 # move high x4 for cos term
+
+ movsd %xmm3,%xmm6 # move low x2 for x3 for sin term (cossin)
+ mulpd %xmm10,%xmm7
+
+ mulsd %xmm1,%xmm6 # get low x3 for sin term (cossin)
+ movhlps %xmm7,%xmm7 # get high x3 for sin term (sincos)
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm2 # 0.5*x2 for cos term
+ mulpd .L__real_3fe0000000000000(%rip),%xmm3 # 0.5*x2 for cos term
+
+ addpd %xmm8,%xmm4 # z
+ addpd %xmm9,%xmm5 # z
+
+
+ movhlps %xmm3,%xmm12 # move high r for cos (cossin)
+
+
+ movhlps %xmm4,%xmm8 # xmm8 = sin , xmm4 = cos (sincos)
+ movhlps %xmm5,%xmm9 # xmm9 = cos , xmm5 = sin (cossin)
+
+ mulsd %xmm0,%xmm4 # cos *x4
+ mulsd %xmm6,%xmm5 # sin *x3
+ mulsd %xmm7,%xmm8 # sin *x3
+ mulsd %xmm11,%xmm9 # cos *x4
+
+ subsd .L__real_3ff0000000000000(%rip),%xmm2 # -t=r-1.0
+ subsd .L__real_3ff0000000000000(%rip),%xmm12 # -t=r-1.0
+
+ movhlps %xmm10,%xmm11 # move high x for x for sin term (sincos)
+
+ subsd %xmm2,%xmm4 # cos-(-t)
+ subsd %xmm12,%xmm9 # cos-(-t)
+
+ addsd %xmm11,%xmm8 # sin + x
+ addsd %xmm1,%xmm5 # sin + x
+
+ movlhps %xmm8,%xmm4 # cossin
+ movlhps %xmm9,%xmm5 # sincos
+
+ jmp .L__vrsa_sinf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: SIN
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: COS
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm0 # x2 ; SIN
+ movapd %xmm3,%xmm11 # x2 ; COS
+ movapd %xmm3,%xmm1 # copy of x2 for x4
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm0 # x4
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm3,%xmm1 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+
+ mulpd %xmm0,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm1,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm1,%xmm5 # x4 * zc
+
+ addpd %xmm10,%xmm4 # +x
+ subpd %xmm11,%xmm5 # +t
+
+ jmp .L__vrsa_sinf_cleanup
+
+.align 16
+.Lsinsin_coscos_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr: COS
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr: SIN
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ movapd %xmm2,%xmm0 # x2 ; COS
+ movapd %xmm3,%xmm11 # x2 ; SIN
+ movapd %xmm2,%xmm10 # copy of x2 for x4
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # s4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # s2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # s4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # s2*x2
+
+ mulpd %xmm2,%xmm10 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # s3+x2c4
+ addpd .Lcosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # s1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm10,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm11,%xmm5 # x4(s3+x2s4)
+
+ subpd .L__real_3ff0000000000000(%rip),%xmm0 # -t=r-1.0
+ addpd %xmm8,%xmm4 # zc
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm10,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # x3 * zc
+
+ subpd %xmm0,%xmm4 # +t
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrsa_sinf_cleanup
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+.align 16
+.Lcoscos_cossin_piby4: #Derive from cossin_coscos
+ movhlps %xmm2,%xmm0 # x2 for 0.5x2 for upper cos
+ movsd %xmm2,%xmm6 # lower x2 for x3 for lower sin
+ movapd %xmm3,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lsincosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcosarray(%rip),%xmm9 # c2+x2c1
+
+
+ movapd %xmm12,%xmm2 # upper=x4
+ movsd %xmm6,%xmm2 # lower=x2
+ mulsd %xmm10,%xmm2 # lower=x2*x
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # upper= x4 * zc
+ # lower=x3 * zs
+ mulpd %xmm13,%xmm5 # x4 * zc
+
+
+ movlhps %xmm7,%xmm10 #
+ addpd %xmm10,%xmm4 # +x for lower sin, +t for upper cos
+ subpd %xmm11,%xmm5 # -(-t)
+
+ jmp .L__vrsa_sinf_cleanup
+.align 16
+.Lcoscos_sincos_piby4: #Derive from cossin_coscos
+ movsd %xmm2,%xmm0 # x2 for 0.5x2 for lower cos
+ movapd %xmm3,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcossinarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lcosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcossinarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lcosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcossinarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcosarray(%rip),%xmm9 # c2+x2c1
+
+ mulpd %xmm10,%xmm2 # upper=x3 for sin
+ mulsd %xmm10,%xmm2 # lower=x4 for cos
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm2,%xmm4 # lower= x4 * zc
+ # upper= x3 * zs
+ mulpd %xmm13,%xmm5 # x4 * zc
+
+
+ movsd %xmm7,%xmm10
+ addpd %xmm10,%xmm4 # +x for upper sin, +t for lower cos
+ subpd %xmm11,%xmm5 # -(-t)
+
+ jmp .L__vrsa_sinf_cleanup
+.align 16
+.Lcossin_coscos_piby4:
+ movhlps %xmm3,%xmm0 # x2 for 0.5x2 for upper cos
+ movapd %xmm2,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd %xmm3,%xmm6 # lower x2 for x3 for sin
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # c2
+
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lsincosarray(%rip),%xmm9 # c2+x2c1
+
+ movapd %xmm13,%xmm3 # upper=x4
+ movsd %xmm6,%xmm3 # lower x2
+ mulsd %xmm1,%xmm3 # lower x2*x
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm12,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # upper= x4 * zc
+ # lower=x3 * zs
+
+ movlhps %xmm7,%xmm1
+ addpd %xmm1,%xmm5 # +x for lower sin, +t for upper cos
+ subpd %xmm11,%xmm4 # -(-t)
+
+ jmp .L__vrsa_sinf_cleanup
+
+.align 16
+.Lcossin_sinsin_piby4: # Derived from sincos_coscos
+
+ movhlps %xmm3,%xmm0 # x2
+ movapd %xmm3,%xmm7
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsincosarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsincosarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsincosarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsincosarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+ movapd %xmm13,%xmm3 # upper x4 for cos
+ movsd %xmm7,%xmm3 # lower x2 for sin
+ mulsd %xmm1,%xmm3 # lower x3=x2*x for sin
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm11,%xmm1 # t for upper cos and x for lower sin
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # upper=x4 * zc
+ # lower=x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +t upper, +x lower
+
+
+ jmp .L__vrsa_sinf_cleanup
+.align 16
+.Lsincos_coscos_piby4:
+ movsd %xmm3,%xmm0 # x2 for 0.5x2 for lower cos
+ movapd %xmm2,%xmm11 # x2 for 0.5x2
+ movapd %xmm2,%xmm12 # x2 for x4
+ movapd %xmm3,%xmm13 # x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm7
+
+ movdqa .Lcosarray+0x30(%rip),%xmm4 # cs4
+ movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcosarray+0x10(%rip),%xmm8 # cs2
+ movapd .Lcossinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # 0.5 *x2
+ mulpd .L__real_3fe0000000000000(%rip),%xmm11 # 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ subsd %xmm0,%xmm7 # t=1.0-r for cos
+ subpd .L__real_3ff0000000000000(%rip),%xmm11 # -t=r-1.0
+ mulpd %xmm2,%xmm12 # x4
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcosarray+0x20(%rip),%xmm4 # c4+x2c3
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # c4+x2c3
+ addpd .Lcosarray(%rip),%xmm8 # c2+x2c1
+ addpd .Lcossinarray(%rip),%xmm9 # c2+x2c1
+
+ mulpd %xmm1,%xmm3 # upper=x3 for sin
+ mulsd %xmm1,%xmm3 # lower=x4 for cos
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zczs
+ addpd %xmm9,%xmm5 # zc
+
+ mulpd %xmm12,%xmm4 # x4 * zc
+ mulpd %xmm3,%xmm5 # lower= x4 * zc
+ # upper= x3 * zs
+
+ movsd %xmm7,%xmm1
+ subpd %xmm11,%xmm4 # -(-t)
+ addpd %xmm1,%xmm5 # +x for upper sin, +t for lower cos
+
+
+ jmp .L__vrsa_sinf_cleanup
+
+.align 16
+.Lsincos_sinsin_piby4: # Derived from sincos_coscos
+
+ movsd %xmm3,%xmm0 # x2
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lcossinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lcossinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lcossinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lcossinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm10,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # upper x3 for sin
+ mulsd %xmm1,%xmm3 # lower x4 for cos
+
+ movhlps %xmm1,%xmm6
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm6,%xmm11 # upper =t ; lower =x
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # lower=x4 * zc
+ # upper=x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm11,%xmm5 # +t lower, +x upper
+
+ jmp .L__vrsa_sinf_cleanup
+
+.align 16
+.Lsinsin_cossin_piby4: # Derived from sincos_coscos
+
+ movhlps %xmm2,%xmm0 # x2
+ movapd %xmm2,%xmm7
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lsincosarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lsincosarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lsincosarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lsincosarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ movapd %xmm12,%xmm2 # upper x4 for cos
+ movsd %xmm7,%xmm2 # lower x2 for sin
+ mulsd %xmm10,%xmm2 # lower x3=x2*x for sin
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm11,%xmm10 # t for upper cos and x for lower sin
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # upper=x4 * zc
+ # lower=x3 * zs
+
+ addpd %xmm1,%xmm5 # +x
+ addpd %xmm10,%xmm4 # +t upper, +x lower
+
+ jmp .L__vrsa_sinf_cleanup
+
+.align 16
+.Lsinsin_sincos_piby4: # Derived from sincos_coscos
+
+ movsd %xmm2,%xmm0 # x2
+ movapd %xmm2,%xmm12 # copy of x2 for x4
+ movapd %xmm3,%xmm13 # copy of x2 for x4
+ movsd .L__real_3ff0000000000000(%rip),%xmm11
+
+ movdqa .Lcossinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+ movapd .Lcossinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulsd .L__real_3fe0000000000000(%rip),%xmm0 # r = 0.5 *x2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ mulpd %xmm2,%xmm12 # x4
+ subsd %xmm0,%xmm11 # t=1.0-r for cos
+ mulpd %xmm3,%xmm13 # x4
+
+ addpd .Lcossinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+ addpd .Lcossinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm1,%xmm3 # x3
+ mulpd %xmm10,%xmm2 # upper x3 for sin
+ mulsd %xmm10,%xmm2 # lower x4 for cos
+
+ movhlps %xmm10,%xmm6
+
+ mulpd %xmm12,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm13,%xmm5 # x4(c3+x2c4)
+
+ movlhps %xmm6,%xmm11
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zszc
+
+ mulpd %xmm3,%xmm5 # x3 * zs
+ mulpd %xmm2,%xmm4 # lower=x4 * zc
+ # upper=x3 * zs
+
+ addpd %xmm1,%xmm5 # +x
+ addpd %xmm11,%xmm4 # +t lower, +x upper
+
+ jmp .L__vrsa_sinf_cleanup
+
+.align 16
+.Lsinsin_sinsin_piby4:
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+# p_sign0 = Sign, xmm10 = r, xmm2 = %xmm6,%r2 =rr
+# p_sign1 = Sign, xmm1 = r, xmm3 = %xmm7,%r2 =rr
+#;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
+
+ #x2 = x * x;
+ #(x + x * x2 * (c1 + x2 * (c2 + x2 * (c3 + x2 * c4))));
+
+ #x + x3 * ((c1 + x2 *c2) + x4 * (c3 + x2 * c4));
+
+
+ movapd %xmm2,%xmm0 # x2
+ movapd %xmm3,%xmm11 # x2
+
+ movdqa .Lsinarray+0x30(%rip),%xmm4 # c4
+ movdqa .Lsinarray+0x30(%rip),%xmm5 # c4
+
+ mulpd %xmm2,%xmm0 # x4
+ mulpd %xmm3,%xmm11 # x4
+
+ movapd .Lsinarray+0x10(%rip),%xmm8 # c2
+ movapd .Lsinarray+0x10(%rip),%xmm9 # c2
+
+ mulpd %xmm2,%xmm4 # c4*x2
+ mulpd %xmm3,%xmm5 # c4*x2
+
+ mulpd %xmm2,%xmm8 # c2*x2
+ mulpd %xmm3,%xmm9 # c2*x2
+
+ addpd .Lsinarray+0x20(%rip),%xmm4 # c3+x2c4
+ addpd .Lsinarray+0x20(%rip),%xmm5 # c3+x2c4
+
+ mulpd %xmm10,%xmm2 # x3
+ mulpd %xmm1,%xmm3 # x3
+
+ addpd .Lsinarray(%rip),%xmm8 # c1+x2c2
+ addpd .Lsinarray(%rip),%xmm9 # c1+x2c2
+
+ mulpd %xmm0,%xmm4 # x4(c3+x2c4)
+ mulpd %xmm11,%xmm5 # x4(c3+x2c4)
+
+ addpd %xmm8,%xmm4 # zs
+ addpd %xmm9,%xmm5 # zs
+
+ mulpd %xmm2,%xmm4 # x3 * zs
+ mulpd %xmm3,%xmm5 # x3 * zs
+
+ addpd %xmm10,%xmm4 # +x
+ addpd %xmm1,%xmm5 # +x
+
+ jmp .L__vrsa_sinf_cleanup
diff --git a/src/hypot.c b/src/hypot.c
new file mode 100644
index 0000000..063d526
--- /dev/null
+++ b/src/hypot.c
@@ -0,0 +1,223 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_SCALEDOUBLE_1
+#define USE_INFINITY_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_SCALEDOUBLE_1
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline double retval_errno_erange_overflow(double x, double y)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = y;
+ exc.type = OVERFLOW;
+ exc.name = (char *)"hypot";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = HUGE;
+ else
+ exc.retval = infinity_with_flags(AMD_F_OVERFLOW | AMD_F_INEXACT);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(ERANGE);
+ else if (!matherr(&exc))
+ __set_errno(ERANGE);
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+double FN_PROTOTYPE(hypot)(double x, double y)
+#else
+double FN_PROTOTYPE(hypot)(double x, double y)
+#endif
+{
+ /* Returns sqrt(x*x + y*y) with no overflow or underflow unless
+ the result warrants it */
+
+ const double large = 1.79769313486231570815e+308; /* 0x7fefffffffffffff */
+
+ double u, r, retval, hx, tx, x2, hy, ty, y2, hs, ts;
+ unsigned long long xexp, yexp, ux, uy, ut;
+ int dexp, expadjust;
+
+ GET_BITS_DP64(x, ux);
+ ux &= ~SIGNBIT_DP64;
+ GET_BITS_DP64(y, uy);
+ uy &= ~SIGNBIT_DP64;
+ xexp = (ux >> EXPSHIFTBITS_DP64);
+ yexp = (uy >> EXPSHIFTBITS_DP64);
+
+ if (xexp == BIASEDEMAX_DP64 + 1 || yexp == BIASEDEMAX_DP64 + 1)
+ {
+ /* One or both of the arguments are NaN or infinity. The
+ result will also be NaN or infinity. */
+ retval = x*x + y*y;
+ if (((xexp == BIASEDEMAX_DP64 + 1) && !(ux & MANTBITS_DP64)) ||
+ ((yexp == BIASEDEMAX_DP64 + 1) && !(uy & MANTBITS_DP64)))
+ /* x or y is infinity. ISO C99 defines that we must
+ return +infinity, even if the other argument is NaN.
+ Note that the computation of x*x + y*y above will already
+ have raised invalid if either x or y is a signalling NaN. */
+ return infinity_with_flags(0);
+ else
+ /* One or both of x or y is NaN, and neither is infinity.
+ Raise invalid if it's a signalling NaN */
+ return retval;
+ }
+
+ /* Set x = abs(x) and y = abs(y) */
+ PUT_BITS_DP64(ux, x);
+ PUT_BITS_DP64(uy, y);
+
+ /* The difference in exponents between x and y */
+ dexp = (int)(xexp - yexp);
+ expadjust = 0;
+
+ if (ux == 0)
+ /* x is zero */
+ return y;
+ else if (uy == 0)
+ /* y is zero */
+ return x;
+ else if (dexp > MANTLENGTH_DP64 + 1 || dexp < -MANTLENGTH_DP64 - 1)
+ /* One of x and y is insignificant compared to the other */
+ return x + y; /* Raise inexact */
+ else if (xexp > EXPBIAS_DP64 + 500 || yexp > EXPBIAS_DP64 + 500)
+ {
+ /* Danger of overflow; scale down by 2**600. */
+ expadjust = 600;
+ ux -= 0x2580000000000000;
+ PUT_BITS_DP64(ux, x);
+ uy -= 0x2580000000000000;
+ PUT_BITS_DP64(uy, y);
+ }
+ else if (xexp < EXPBIAS_DP64 - 500 || yexp < EXPBIAS_DP64 - 500)
+ {
+ /* Danger of underflow; scale up by 2**600. */
+ expadjust = -600;
+ if (xexp == 0)
+ {
+ /* x is denormal - handle by adding 601 to the exponent
+ and then subtracting a correction for the implicit bit */
+ PUT_BITS_DP64(ux + 0x2590000000000000, x);
+ x -= 9.23297861778573578076e-128; /* 0x2590000000000000 */
+ GET_BITS_DP64(x, ux);
+ }
+ else
+ {
+ /* x is normal - just increase the exponent by 600 */
+ ux += 0x2580000000000000;
+ PUT_BITS_DP64(ux, x);
+ }
+ if (yexp == 0)
+ {
+ PUT_BITS_DP64(uy + 0x2590000000000000, y);
+ y -= 9.23297861778573578076e-128; /* 0x2590000000000000 */
+ GET_BITS_DP64(y, uy);
+ }
+ else
+ {
+ uy += 0x2580000000000000;
+ PUT_BITS_DP64(uy, y);
+ }
+ }
+
+
+#ifdef FAST_BUT_GREATER_THAN_ONE_ULP
+ /* Not awful, but results in accuracy loss larger than 1 ulp */
+ r = x*x + y*y
+#else
+ /* Slower but more accurate */
+
+ /* Sort so that x is greater than y */
+ if (x < y)
+ {
+ u = y;
+ y = x;
+ x = u;
+ ut = ux;
+ ux = uy;
+ uy = ut;
+ }
+
+ /* Split x into hx and tx, head and tail */
+ PUT_BITS_DP64(ux & 0xfffffffff8000000, hx);
+ tx = x - hx;
+
+ PUT_BITS_DP64(uy & 0xfffffffff8000000, hy);
+ ty = y - hy;
+
+ /* Compute r = x*x + y*y with extra precision */
+ x2 = x*x;
+ y2 = y*y;
+ hs = x2 + y2;
+
+ if (dexp == 0)
+ /* We take most care when x and y have equal exponents,
+ i.e. are almost the same size */
+ ts = (((x2 - hs) + y2) +
+ ((hx * hx - x2) + 2 * hx * tx) + tx * tx) +
+ ((hy * hy - y2) + 2 * hy * ty) + ty * ty;
+ else
+ ts = (((x2 - hs) + y2) +
+ ((hx * hx - x2) + 2 * hx * tx) + tx * tx);
+
+ r = hs + ts;
+#endif
+
+ /* The sqrt can introduce another half ulp error. */
+#ifdef WINDOWS
+ /* VC++ intrinsic call */
+ _mm_store_sd(&retval, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&r)));
+#else
+ /* Hammer sqrt instruction */
+ asm volatile ("sqrtsd %1, %0" : "=x" (retval) : "x" (r));
+#endif
+
+ /* If necessary scale the result back. This may lead to
+ overflow but if so that's the correct result. */
+ retval = scaleDouble_1(retval, expadjust);
+
+ if (retval > large)
+ /* The result overflowed. Deal with errno. */
+#ifdef WINDOWS
+ return handle_error("hypot", PINFBITPATT_DP64, _OVERFLOW,
+ AMD_F_OVERFLOW | AMD_F_INEXACT, ERANGE, x, y);
+#else
+ return retval_errno_erange_overflow(x, y);
+#endif
+
+ return retval;
+}
+
+weak_alias (__hypot, hypot)
diff --git a/src/hypotf.c b/src/hypotf.c
new file mode 100644
index 0000000..fcc09fc
--- /dev/null
+++ b/src/hypotf.c
@@ -0,0 +1,131 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#ifdef USE_SOFTWARE_SQRT
+#define USE_SQRTF_AMD_INLINE
+#endif
+#define USE_INFINITYF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#ifdef USE_SOFTWARE_SQRT
+#undef USE_SQRTF_AMD_INLINE
+#endif
+#undef USE_INFINITYF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline float retval_errno_erange_overflow(float x, float y)
+{
+ struct exception exc;
+ exc.arg1 = (double)x;
+ exc.arg2 = (double)y;
+ exc.type = OVERFLOW;
+ exc.name = (char *)"hypotf";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = HUGE;
+ else
+ exc.retval = infinityf_with_flags(AMD_F_OVERFLOW | AMD_F_INEXACT);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(ERANGE);
+ else if (!matherr(&exc))
+ __set_errno(ERANGE);
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+float FN_PROTOTYPE(hypotf)(float x, float y)
+#else
+float FN_PROTOTYPE(hypotf)(float x, float y)
+#endif
+{
+ /* Returns sqrt(x*x + y*y) with no overflow or underflow unless
+ the result warrants it */
+
+ /* Do intermediate computations in double precision
+ and use sqrt instruction from chip if available. */
+ double dx = x, dy = y, dr, retval;
+
+ /* The largest finite float, stored as a double */
+ const double large = 3.40282346638528859812e+38; /* 0x47efffffe0000000 */
+
+
+ unsigned long long ux, uy, avx, avy;
+
+ GET_BITS_DP64(x, avx);
+ avx &= ~SIGNBIT_DP64;
+ GET_BITS_DP64(y, avy);
+ avy &= ~SIGNBIT_DP64;
+ ux = (avx >> EXPSHIFTBITS_DP64);
+ uy = (avy >> EXPSHIFTBITS_DP64);
+
+ if (ux == BIASEDEMAX_DP64 + 1 || uy == BIASEDEMAX_DP64 + 1)
+ {
+ retval = x*x + y*y;
+ /* One or both of the arguments are NaN or infinity. The
+ result will also be NaN or infinity. */
+ if (((ux == BIASEDEMAX_DP64 + 1) && !(avx & MANTBITS_DP64)) ||
+ ((uy == BIASEDEMAX_DP64 + 1) && !(avy & MANTBITS_DP64)))
+ /* x or y is infinity. ISO C99 defines that we must
+ return +infinity, even if the other argument is NaN.
+ Note that the computation of x*x + y*y above will already
+ have raised invalid if either x or y is a signalling NaN. */
+ return infinityf_with_flags(0);
+ else
+ /* One or both of x or y is NaN, and neither is infinity.
+ Raise invalid if it's a signalling NaN */
+ return (float)retval;
+ }
+
+ dr = (dx*dx + dy*dy);
+
+#if USE_SOFTWARE_SQRT
+ retval = sqrtf_amd_inline(r);
+#else
+#ifdef WINDOWS
+ /* VC++ intrinsic call */
+ _mm_store_sd(&retval, _mm_sqrt_sd(_mm_setzero_pd(), _mm_load_sd(&dr)));
+#else
+ /* Hammer sqrt instruction */
+ asm volatile ("sqrtsd %1, %0" : "=x" (retval) : "x" (dr));
+#endif
+#endif
+
+ if (retval > large)
+#ifdef WINDOWS
+ return handle_errorf("hypotf", PINFBITPATT_SP32, _OVERFLOW,
+ AMD_F_OVERFLOW | AMD_F_INEXACT, ERANGE, x, y);
+#else
+ return retval_errno_erange_overflow(x, y);
+#endif
+ else
+ return (float)retval;
+ }
+
+weak_alias (__hypotf, hypotf)
diff --git a/src/ilogb.c b/src/ilogb.c
new file mode 100644
index 0000000..2c1cb7c
--- /dev/null
+++ b/src/ilogb.c
@@ -0,0 +1,99 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include <limits.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+
+int FN_PROTOTYPE(ilogb)(double x)
+{
+
+
+ /* Check for input range */
+ UT64 checkbits;
+ int expbits;
+ U64 manbits;
+ U64 zerovalue;
+ /* Clear the sign bit and check if the value is zero nan or inf.*/
+ checkbits.f64=x;
+ zerovalue = (checkbits.u64 & ~SIGNBIT_DP64);
+
+ if(zerovalue == 0)
+ {
+ /* Raise exception as the number zero*/
+ __amd_handle_error(DOMAIN, EDOM, "ilogb", x, 0.0 ,(double)INT_MIN);
+
+
+ return INT_MIN;
+ }
+
+ if( zerovalue == EXPBITS_DP64 )
+ {
+ /* Raise exception as the number is inf */
+
+ __amd_handle_error(DOMAIN, EDOM, "ilogb", x, 0.0 ,(double)INT_MAX);
+
+ return INT_MAX;
+ }
+
+ if( zerovalue > EXPBITS_DP64 )
+ {
+ /* Raise exception as the number is nan */
+ __amd_handle_error(DOMAIN, EDOM, "ilogb", x, 0.0 ,(double)INT_MIN);
+
+
+ return INT_MIN;
+ }
+
+ expbits = (int) (( checkbits.u64 << 1) >> 53);
+
+ if(expbits == 0 && (checkbits.u64 & MANTBITS_DP64 )!= 0)
+ {
+ /* the value is denormalized */
+ manbits = checkbits.u64 & MANTBITS_DP64;
+ expbits = EMIN_DP64;
+ while (manbits < IMPBIT_DP64)
+ {
+ manbits <<= 1;
+ expbits--;
+ }
+ }
+ else
+ {
+
+ expbits-=EXPBIAS_DP64;
+ }
+
+
+ return expbits;
+}
diff --git a/src/ilogbf.c b/src/ilogbf.c
new file mode 100644
index 0000000..cb129e6
--- /dev/null
+++ b/src/ilogbf.c
@@ -0,0 +1,109 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include <limits.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+int FN_PROTOTYPE(ilogbf)(float x)
+{
+
+ /* Check for input range */
+ UT32 checkbits;
+ int expbits;
+ U32 manbits;
+ U32 zerovalue;
+ checkbits.f32=x;
+
+ /* Clear the sign bit and check if the value is zero nan or inf.*/
+ zerovalue = (checkbits.u32 & ~SIGNBIT_SP32);
+
+ if(zerovalue == 0)
+ {
+ /* Raise exception as the number zero*/
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "ilogb", x, is_x_snan, 0.0f, 0, (float) INT_MIN, 0);
+ }
+
+ return INT_MIN;
+ }
+
+ if( zerovalue == EXPBITS_SP32 )
+ {
+ /* Raise exception as the number is inf */
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "ilogb", x, is_x_snan, 0.0f, 0, (float) INT_MAX, 0);
+ }
+
+ return INT_MAX;
+ }
+
+ if( zerovalue > EXPBITS_SP32 )
+ {
+ /* Raise exception as the number is nan */
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "ilogb", x, is_x_snan, 0.0f, 0, (float) INT_MIN, 0);
+ }
+
+ return INT_MIN;
+ }
+
+ expbits = (int) (( checkbits.u32 << 1) >> 24);
+
+ if(expbits == 0 && (checkbits.u32 & MANTBITS_SP32 )!= 0)
+ {
+ /* the value is denormalized */
+ manbits = checkbits.u32 & MANTBITS_SP32;
+ expbits = EMIN_SP32;
+ while (manbits < IMPBIT_SP32)
+ {
+ manbits <<= 1;
+ expbits--;
+ }
+ }
+ else
+ {
+ expbits-=EXPBIAS_SP32;
+ }
+
+
+ return expbits;
+}
diff --git a/src/ldexp.c b/src/ldexp.c
new file mode 100644
index 0000000..695118b
--- /dev/null
+++ b/src/ldexp.c
@@ -0,0 +1,117 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+double FN_PROTOTYPE(ldexp)(double x, int n)
+{
+ UT64 val;
+ unsigned int sign;
+ int exponent;
+ val.f64 = x;
+ sign = val.u32[1] & 0x80000000;
+ val.u32[1] = val.u32[1] & 0x7fffffff; /* remove the sign bit */
+
+ if((val.u32[1] & 0x7ff00000)== 0x7ff00000)/* x= nan or x = +-inf*/
+ return x;
+
+ if((val.u64 == 0x0000000000000000) || (n==0))
+ return x; /* x= +-0 or n= 0*/
+
+ exponent = val.u32[1] >> 20; /* get the exponent */
+
+ if(exponent == 0)/*x is denormal*/
+ {
+ val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/
+ exponent = val.u32[1] >> 20; /* get the exponent */
+ exponent = exponent + n - MULTIPLIER_DP;
+ if(exponent < -MULTIPLIER_DP)/*underflow*/
+ {
+ val.u32[1] = sign | 0x00000000;
+ val.u32[0] = 0x00000000;
+ __amd_handle_error(UNDERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64);
+
+ return val.f64;
+ }
+ if(exponent > 2046)/*overflow*/
+ {
+ val.u32[1] = sign | 0x7ff00000;
+ val.u32[0] = 0x00000000;
+ __amd_handle_error(OVERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64);
+
+ return val.f64;
+ }
+
+ exponent += MULTIPLIER_DP;
+ val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+ val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+ return val.f64;
+ }
+
+ exponent += n;
+
+ if(exponent < -MULTIPLIER_DP)/*underflow*/
+ {
+ val.u32[1] = sign | 0x00000000;
+ val.u32[0] = 0x00000000;
+
+ __amd_handle_error(UNDERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64);
+
+
+ return val.f64;
+ }
+
+ if(exponent < 1)/*x is normal but output is debnormal*/
+ {
+ exponent += MULTIPLIER_DP;
+ val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+ val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+ return val.f64;
+ }
+
+ if(exponent > 2046)/*overflow*/
+ {
+ val.u32[1] = sign | 0x7ff00000;
+ val.u32[0] = 0x00000000;
+ __amd_handle_error(OVERFLOW, ERANGE, "ldexp", x,(double)n ,val.f64);
+
+
+ return val.f64;
+ }
+
+ val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+ return val.f64;
+}
+
+
+
diff --git a/src/ldexpf.c b/src/ldexpf.c
new file mode 100644
index 0000000..892c6e9
--- /dev/null
+++ b/src/ldexpf.c
@@ -0,0 +1,133 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#endif
+
+#include <math.h>
+#include <errno.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+
+float FN_PROTOTYPE(ldexpf)(float x, int n)
+{
+ UT32 val;
+ unsigned int sign;
+ int exponent;
+ val.f32 = x;
+ sign = val.u32 & 0x80000000;
+ val.u32 = val.u32 & 0x7fffffff;/* remove the sign bit */
+
+ if((val.u32 & 0x7f800000)== 0x7f800000)/* x= nan or x = +-inf*/
+ return x;
+
+ if((val.u32 == 0x00000000) || (n==0))/* x= +-0 or n= 0*/
+ return x;
+
+ exponent = val.u32 >> 23; /* get the exponent */
+
+ if(exponent == 0)/*x is denormal*/
+ {
+ val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/
+ exponent = (val.u32 >> 23); /* get the exponent */
+ exponent = exponent + n - MULTIPLIER_SP;
+ if(exponent < -MULTIPLIER_SP)/*underflow*/
+ {
+ val.u32 = sign | 0x00000000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(UNDERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0);
+ }
+
+ return val.f32;
+ }
+ if(exponent > 254)/*overflow*/
+ {
+ val.u32 = sign | 0x7f800000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(OVERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0);
+ }
+
+
+ return val.f32;
+ }
+
+ exponent += MULTIPLIER_SP;
+ val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+ val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+ return val.f32;
+ }
+
+ exponent += n;
+
+ if(exponent < -MULTIPLIER_SP)/*underflow*/
+ {
+ val.u32 = sign | 0x00000000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(UNDERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0);
+ }
+
+ return val.f32;
+ }
+
+ if(exponent < 1)/*x is normal but output is debnormal*/
+ {
+ exponent += MULTIPLIER_SP;
+ val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+ val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+ return val.f32;
+ }
+
+ if(exponent > 254)/*overflow*/
+ {
+ val.u32 = sign | 0x7f800000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(OVERFLOW, ERANGE, "ldexpf", x, is_x_snan, (float)n , 0,val.f32, 0);
+ }
+
+ return val.f32;
+ }
+
+ val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);/*x is normal and output is normal*/
+ return val.f32;
+}
+
diff --git a/src/libm_special.c b/src/libm_special.c
new file mode 100644
index 0000000..974d99b
--- /dev/null
+++ b/src/libm_special.c
@@ -0,0 +1,117 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+#ifdef __x86_64__
+
+#include <emmintrin.h>
+#include <math.h>
+#include <errno.h>
+
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+#ifdef WIN64
+#define EXCEPTION_S _exception
+#else
+#define EXCEPTION_S exception
+#endif
+
+
+
+static double convert_snan_32to64(float x)
+{
+ U64 t;
+ UT32 xs;
+ UT64 xb;
+
+ xs.f32 = x;
+ xb.u64 = (((xs.u32 & SIGNBIT_SP32) == SIGNBIT_SP32) ? NINFBITPATT_DP64 : EXPBITS_DP64);
+
+ t = 0;
+ t = (xs.u32 & MANTBITS_SP32);
+ t = (t << 29); // 29 = (52-23)
+ xb.u64 = (xb.u64 | t);
+
+ return xb.f64;
+}
+
+#ifdef NEED_FAKE_MATHERR
+int
+matherr (struct exception *s)
+{
+ return 0;
+}
+#endif
+
+void __amd_handle_errorf(int type, int error, const char *name,
+ float arg1, unsigned int arg1_is_snan,
+ float arg2, unsigned int arg2_is_snan,
+ float retval, unsigned int retval_is_snan)
+{
+ struct EXCEPTION_S exception_data;
+
+ // write exception info
+ exception_data.type = type;
+ exception_data.name = (char*)name;
+
+ // sNaN float to double conversion can trigger interrupt
+ // handle them specially
+
+ if(arg1_is_snan) { exception_data.arg1 = convert_snan_32to64(arg1); }
+ else { exception_data.arg1 = (double)arg1; }
+
+ if(arg2_is_snan) { exception_data.arg2 = convert_snan_32to64(arg2); }
+ else { exception_data.arg2 = (double)arg2; }
+
+ if(retval_is_snan) { exception_data.retval = convert_snan_32to64(retval); }
+ else { exception_data.retval = (double)retval; }
+
+ // call matherr, set errno if matherr returns 0
+ if(!matherr(&exception_data))
+ {
+ errno = error;
+ }
+}
+
+void __amd_handle_error(int type, int error, const char *name,
+ double arg1,
+ double arg2,
+ double retval)
+{
+ struct EXCEPTION_S exception_data;
+
+ // write exception info
+ exception_data.type = type;
+ exception_data.name = (char*)name;
+
+ exception_data.arg1 = arg1;
+ exception_data.arg2 = arg2;
+ exception_data.retval = retval;
+
+ // call matherr, set errno if matherr returns 0
+ if(!matherr(&exception_data))
+ {
+ errno = error;
+ }
+}
+
+#endif /* __x86_64__ */
diff --git a/src/llrint.c b/src/llrint.c
new file mode 100644
index 0000000..5f96115
--- /dev/null
+++ b/src/llrint.c
@@ -0,0 +1,62 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+long long int FN_PROTOTYPE(llrint)(double x)
+{
+
+
+ UT64 checkbits,val_2p52;
+ checkbits.f64=x;
+
+ /* Clear the sign bit and check if the value can be rounded */
+
+ if( (checkbits.u64 & 0x7FFFFFFFFFFFFFFF) > 0x4330000000000000)
+ {
+ /* number cant be rounded raise an exception */
+ /* Number exceeds the representable range could be nan or inf also*/
+ __amd_handle_error(DOMAIN, EDOM, "llrint", x,0.0 ,(double)x);
+
+ return (long long int) x;
+ }
+
+ val_2p52.u32[1] = (checkbits.u32[1] & 0x80000000) | 0x43300000;
+ val_2p52.u32[0] = 0;
+
+
+ /* Add and sub 2^52 to round the number according to the current rounding direction */
+
+ return (long long int) ((x + val_2p52.f64) - val_2p52.f64);
+}
diff --git a/src/llrintf.c b/src/llrintf.c
new file mode 100644
index 0000000..509e46b
--- /dev/null
+++ b/src/llrintf.c
@@ -0,0 +1,67 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+long long int FN_PROTOTYPE(llrintf)(float x)
+{
+
+ UT32 checkbits,val_2p23;
+ checkbits.f32=x;
+
+ /* Clear the sign bit and check if the value can be rounded */
+
+ if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000)
+ {
+ /* number cant be rounded raise an exception */
+ /* Number exceeds the representable range could be nan or inf also*/
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "llrintf", x, is_x_snan, 0.0F , 0,(float)x, 0);
+ }
+
+ return (long long int) x;
+ }
+
+
+ val_2p23.u32 = (checkbits.u32 & 0x80000000) | 0x4B000000;
+
+ /* Add and sub 2^23 to round the number according to the current rounding direction */
+
+ return (long long int) ((x + val_2p23.f32) - val_2p23.f32);
+}
diff --git a/src/llround.c b/src/llround.c
new file mode 100644
index 0000000..0b582c2
--- /dev/null
+++ b/src/llround.c
@@ -0,0 +1,112 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+#ifdef WINDOWS
+/*In windows llong long int is 64 bit and long int is 32 bit.
+ In Linux long long int and long int both are of size 64 bit*/
+long long int FN_PROTOTYPE(llround)(double d)
+{
+ UT64 u64d;
+ UT64 u64Temp,u64result;
+ int intexp, shift;
+ U64 sign;
+ long long int result;
+
+ u64d.f64 = u64Temp.f64 = d;
+
+ if ((u64d.u32[1] & 0X7FF00000) == 0x7FF00000)
+ {
+ /*else the number is infinity*/
+ //Got to raise range or domain error
+ __amd_handle_error(DOMAIN, EDOM, "llround", d, 0.0 , (double)SIGNBIT_DP64);
+ return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/
+ }
+
+ u64Temp.u32[1] &= 0x7FFFFFFF;
+ intexp = (u64d.u32[1] & 0x7FF00000) >> 20;
+ sign = u64d.u64 & 0x8000000000000000;
+ intexp -= 0x3FF;
+
+ /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */
+ if (intexp < -1)
+ return (0);
+
+ /* 1.0 x 2^31 (or 2^63) is already too large */
+ if (intexp >= 63)
+ {
+ /*Based on the sign of the input value return the MAX and MIN*/
+ result = 0x8000000000000000; /*Return LONG MIN*/
+ __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double) result);
+
+ return result;
+ }
+
+ u64result.f64 = u64Temp.f64;
+ /* >= 2^52 is already an exact integer */
+ if (intexp < 52)
+ {
+ /* add 0.5, extraction below will truncate */
+ u64result.f64 = u64Temp.f64 + 0.5;
+ }
+
+ intexp = ((u64result.u32[1] >> 20) & 0x7ff) - 0x3FF;
+
+ u64result.u32[1] &= 0xfffff;
+ u64result.u32[1] |= 0x00100000; /*Mask the last exp bit to 1*/
+ shift = intexp - 52;
+
+ if(shift < 0)
+ u64result.u64 = u64result.u64 >> (-shift);
+ if(shift > 0)
+ u64result.u64 = u64result.u64 << (shift);
+
+ result = u64result.u64;
+
+ if (sign)
+ result = -result;
+
+ return result;
+}
+
+#else //WINDOWS
+/*llroundf is equivalent to the linux implementation of
+ lroundf. Both long int and long long int are of the same size*/
+long long int FN_PROTOTYPE(llround)(double d)
+{
+ long long int result;
+ result = FN_PROTOTYPE(lround)(d);
+ return result;
+}
+#endif
diff --git a/src/llroundf.c b/src/llroundf.c
new file mode 100644
index 0000000..0e1ac8a
--- /dev/null
+++ b/src/llroundf.c
@@ -0,0 +1,132 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+#ifdef WINDOWS
+/*In windows llong long int is 64 bit and long int is 32 bit.
+ In Linux long long int and long int both are of size 64 bit*/
+long long int FN_PROTOTYPE(llroundf)(float f)
+{
+ UT32 u32d;
+ UT32 u32Temp,u32result;
+ int intexp, shift;
+ U32 sign;
+ long long int result;
+
+ u32d.f32 = u32Temp.f32 = f;
+ if ((u32d.u32 & 0X7F800000) == 0x7F800000)
+ {
+ /*else the number is infinity*/
+ //Got to raise range or domain error
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = f;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "llroundf", f, is_x_snan, 0.0F , 0,(float)SIGNBIT_DP64, 0);
+ return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/
+ }
+
+ }
+
+ u32Temp.u32 &= 0x7FFFFFFF;
+ intexp = (u32d.u32 & 0x7F800000) >> 23;
+ sign = u32d.u32 & 0x80000000;
+ intexp -= 0x7F;
+
+
+ /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */
+ if (intexp < -1)
+ return (0);
+
+
+ /* 1.0 x 2^31 (or 2^63) is already too large */
+ if (intexp >= 63)
+ {
+ result = 0x8000000000000000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = f;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)result, 0);
+ }
+
+ return result;
+ }
+
+ u32result.f32 = u32Temp.f32;
+
+ /* >= 2^52 is already an exact integer */
+ if (intexp < 23)
+ {
+ /* add 0.5, extraction below will truncate */
+ u32result.f32 = u32Temp.f32 + 0.5F;
+ }
+ intexp = (u32result.u32 & 0x7f800000) >> 23;
+ intexp -= 0x7f;
+ u32result.u32 &= 0x7fffff;
+ u32result.u32 |= 0x00800000;
+
+ result = u32result.u32;
+
+ /*Since float is only 32 bit for higher accuracy we shift the result by 32 bits
+ * In the next step we shift an extra 32 bits in the reverse direction based
+ * on the value of intexp*/
+ result = result << 32;
+ shift = intexp - 55; /*55= 23 +32*/
+
+
+ if(shift < 0)
+ result = result >> (-shift);
+ if(shift > 0)
+ result = result << (shift);
+
+ if (sign)
+ result = -result;
+ return result;
+
+}
+
+#else //WINDOWS
+/*llroundf is equivalent to the linux implementation of
+ lroundf. Both long int and long long int are of the same size*/
+long long int FN_PROTOTYPE(llroundf)(float f)
+{
+ long long int result;
+ result = FN_PROTOTYPE(lroundf)(f);
+ return result;
+
+}
+#endif
+
diff --git a/src/log1p.c b/src/log1p.c
new file mode 100644
index 0000000..b7cd097
--- /dev/null
+++ b/src/log1p.c
@@ -0,0 +1,475 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_NAN_WITH_FLAGS
+#define USE_VAL_WITH_FLAGS
+#define USE_INFINITY_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline double retval_errno_erange_overflow(double x)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = x;
+ exc.type = SING;
+ exc.name = (char *)"log1p";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = -HUGE;
+ else
+ exc.retval = -infinity_with_flags(AMD_F_DIVBYZERO);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(ERANGE);
+ else if (!matherr(&exc))
+ __set_errno(ERANGE);
+ return exc.retval;
+}
+
+/* Deal with errno for out-of-range argument */
+static inline double retval_errno_edom(double x)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = x;
+ exc.type = DOMAIN;
+ exc.name = (char *)"log1p";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = -HUGE;
+ else
+ exc.retval = nan_with_flags(AMD_F_INVALID);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("log1p: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "log1p"
+
+double FN_PROTOTYPE(log1p)(double x)
+{
+
+ int xexp;
+ double r, r1, r2, correction, f, f1, f2, q, u, v, z1, z2, poly, m2;
+ int index;
+ unsigned long long ux, ax;
+
+ /*
+ Computes natural log(1+x). Algorithm based on:
+ Ping-Tak Peter Tang
+ "Table-driven implementation of the logarithm function in IEEE
+ floating-point arithmetic"
+ ACM Transactions on Mathematical Software (TOMS)
+ Volume 16, Issue 4 (December 1990)
+ Note that we use a lookup table of size 64 rather than 128,
+ and compensate by having extra terms in the minimax polynomial
+ for the kernel approximation.
+ */
+
+/* Arrays ln_lead_table and ln_tail_table contain
+ leading and trailing parts respectively of precomputed
+ values of natural log(1+i/64), for i = 0, 1, ..., 64.
+ ln_lead_table contains the first 24 bits of precision,
+ and ln_tail_table contains a further 53 bits precision. */
+
+ static const double ln_lead_table[65] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.55041813850402832031e-02, /* 0x3f8fc0a800000000 */
+ 3.07716131210327148438e-02, /* 0x3f9f829800000000 */
+ 4.58095073699951171875e-02, /* 0x3fa7745800000000 */
+ 6.06245994567871093750e-02, /* 0x3faf0a3000000000 */
+ 7.52233862876892089844e-02, /* 0x3fb341d700000000 */
+ 8.96121263504028320312e-02, /* 0x3fb6f0d200000000 */
+ 1.03796780109405517578e-01, /* 0x3fba926d00000000 */
+ 1.17783010005950927734e-01, /* 0x3fbe270700000000 */
+ 1.31576299667358398438e-01, /* 0x3fc0d77e00000000 */
+ 1.45181953907012939453e-01, /* 0x3fc2955280000000 */
+ 1.58604979515075683594e-01, /* 0x3fc44d2b00000000 */
+ 1.71850204467773437500e-01, /* 0x3fc5ff3000000000 */
+ 1.84922337532043457031e-01, /* 0x3fc7ab8900000000 */
+ 1.97825729846954345703e-01, /* 0x3fc9525a80000000 */
+ 2.10564732551574707031e-01, /* 0x3fcaf3c900000000 */
+ 2.23143517971038818359e-01, /* 0x3fcc8ff780000000 */
+ 2.35566020011901855469e-01, /* 0x3fce270700000000 */
+ 2.47836112976074218750e-01, /* 0x3fcfb91800000000 */
+ 2.59957492351531982422e-01, /* 0x3fd0a324c0000000 */
+ 2.71933674812316894531e-01, /* 0x3fd1675c80000000 */
+ 2.83768117427825927734e-01, /* 0x3fd22941c0000000 */
+ 2.95464158058166503906e-01, /* 0x3fd2e8e280000000 */
+ 3.07025015354156494141e-01, /* 0x3fd3a64c40000000 */
+ 3.18453729152679443359e-01, /* 0x3fd4618bc0000000 */
+ 3.29753279685974121094e-01, /* 0x3fd51aad80000000 */
+ 3.40926527976989746094e-01, /* 0x3fd5d1bd80000000 */
+ 3.51976394653320312500e-01, /* 0x3fd686c800000000 */
+ 3.62905442714691162109e-01, /* 0x3fd739d7c0000000 */
+ 3.73716354370117187500e-01, /* 0x3fd7eaf800000000 */
+ 3.84411692619323730469e-01, /* 0x3fd89a3380000000 */
+ 3.94993782043457031250e-01, /* 0x3fd9479400000000 */
+ 4.05465066432952880859e-01, /* 0x3fd9f323c0000000 */
+ 4.15827870368957519531e-01, /* 0x3fda9cec80000000 */
+ 4.26084339618682861328e-01, /* 0x3fdb44f740000000 */
+ 4.36236739158630371094e-01, /* 0x3fdbeb4d80000000 */
+ 4.46287095546722412109e-01, /* 0x3fdc8ff7c0000000 */
+ 4.56237375736236572266e-01, /* 0x3fdd32fe40000000 */
+ 4.66089725494384765625e-01, /* 0x3fddd46a00000000 */
+ 4.75845873355865478516e-01, /* 0x3fde744240000000 */
+ 4.85507786273956298828e-01, /* 0x3fdf128f40000000 */
+ 4.95077252388000488281e-01, /* 0x3fdfaf5880000000 */
+ 5.04556000232696533203e-01, /* 0x3fe02552a0000000 */
+ 5.13945698738098144531e-01, /* 0x3fe0723e40000000 */
+ 5.23248136043548583984e-01, /* 0x3fe0be72e0000000 */
+ 5.32464742660522460938e-01, /* 0x3fe109f380000000 */
+ 5.41597247123718261719e-01, /* 0x3fe154c3c0000000 */
+ 5.50647079944610595703e-01, /* 0x3fe19ee6a0000000 */
+ 5.59615731239318847656e-01, /* 0x3fe1e85f40000000 */
+ 5.68504691123962402344e-01, /* 0x3fe23130c0000000 */
+ 5.77315330505371093750e-01, /* 0x3fe2795e00000000 */
+ 5.86049020290374755859e-01, /* 0x3fe2c0e9e0000000 */
+ 5.94707071781158447266e-01, /* 0x3fe307d720000000 */
+ 6.03290796279907226562e-01, /* 0x3fe34e2880000000 */
+ 6.11801505088806152344e-01, /* 0x3fe393e0c0000000 */
+ 6.20240390300750732422e-01, /* 0x3fe3d90260000000 */
+ 6.28608644008636474609e-01, /* 0x3fe41d8fe0000000 */
+ 6.36907458305358886719e-01, /* 0x3fe4618bc0000000 */
+ 6.45137906074523925781e-01, /* 0x3fe4a4f840000000 */
+ 6.53301239013671875000e-01, /* 0x3fe4e7d800000000 */
+ 6.61398470401763916016e-01, /* 0x3fe52a2d20000000 */
+ 6.69430613517761230469e-01, /* 0x3fe56bf9c0000000 */
+ 6.77398800849914550781e-01, /* 0x3fe5ad4040000000 */
+ 6.85303986072540283203e-01, /* 0x3fe5ee02a0000000 */
+ 6.93147122859954833984e-01}; /* 0x3fe62e42e0000000 */
+
+ static const double ln_tail_table[65] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 5.15092497094772879206e-09, /* 0x3e361f807c79f3db */
+ 4.55457209735272790188e-08, /* 0x3e6873c1980267c8 */
+ 2.86612990859791781788e-08, /* 0x3e5ec65b9f88c69e */
+ 2.23596477332056055352e-08, /* 0x3e58022c54cc2f99 */
+ 3.49498983167142274770e-08, /* 0x3e62c37a3a125330 */
+ 3.23392843005887000414e-08, /* 0x3e615cad69737c93 */
+ 1.35722380472479366661e-08, /* 0x3e4d256ab1b285e9 */
+ 2.56504325268044191098e-08, /* 0x3e5b8abcb97a7aa2 */
+ 5.81213608741512136843e-08, /* 0x3e6f34239659a5dc */
+ 5.59374849578288093334e-08, /* 0x3e6e07fd48d30177 */
+ 5.06615629004996189970e-08, /* 0x3e6b32df4799f4f6 */
+ 5.24588857848400955725e-08, /* 0x3e6c29e4f4f21cf8 */
+ 9.61968535632653505972e-10, /* 0x3e1086c848df1b59 */
+ 1.34829655346594463137e-08, /* 0x3e4cf456b4764130 */
+ 3.65557749306383026498e-08, /* 0x3e63a02ffcb63398 */
+ 3.33431709374069198903e-08, /* 0x3e61e6a6886b0976 */
+ 5.13008650536088382197e-08, /* 0x3e6b8abcb97a7aa2 */
+ 5.09285070380306053751e-08, /* 0x3e6b578f8aa35552 */
+ 3.20853940845502057341e-08, /* 0x3e6139c871afb9fc */
+ 4.06713248643004200446e-08, /* 0x3e65d5d30701ce64 */
+ 5.57028186706125221168e-08, /* 0x3e6de7bcb2d12142 */
+ 5.48356693724804282546e-08, /* 0x3e6d708e984e1664 */
+ 1.99407553679345001938e-08, /* 0x3e556945e9c72f36 */
+ 1.96585517245087232086e-09, /* 0x3e20e2f613e85bda */
+ 6.68649386072067321503e-09, /* 0x3e3cb7e0b42724f6 */
+ 5.89936034642113390002e-08, /* 0x3e6fac04e52846c7 */
+ 2.85038578721554472484e-08, /* 0x3e5e9b14aec442be */
+ 5.09746772910284482606e-08, /* 0x3e6b5de8034e7126 */
+ 5.54234668933210171467e-08, /* 0x3e6dc157e1b259d3 */
+ 6.29100830926604004874e-09, /* 0x3e3b05096ad69c62 */
+ 2.61974119468563937716e-08, /* 0x3e5c2116faba4cdd */
+ 4.16752115011186398935e-08, /* 0x3e665fcc25f95b47 */
+ 2.47747534460820790327e-08, /* 0x3e5a9a08498d4850 */
+ 5.56922172017964209793e-08, /* 0x3e6de647b1465f77 */
+ 2.76162876992552906035e-08, /* 0x3e5da71b7bf7861d */
+ 7.08169709942321478061e-09, /* 0x3e3e6a6886b09760 */
+ 5.77453510221151779025e-08, /* 0x3e6f0075eab0ef64 */
+ 4.43021445893361960146e-09, /* 0x3e33071282fb989b */
+ 3.15140984357495864573e-08, /* 0x3e60eb43c3f1bed2 */
+ 2.95077445089736670973e-08, /* 0x3e5faf06ecb35c84 */
+ 1.44098510263167149349e-08, /* 0x3e4ef1e63db35f68 */
+ 1.05196987538551827693e-08, /* 0x3e469743fb1a71a5 */
+ 5.23641361722697546261e-08, /* 0x3e6c1cdf404e5796 */
+ 7.72099925253243069458e-09, /* 0x3e4094aa0ada625e */
+ 5.62089493829364197156e-08, /* 0x3e6e2d4c96fde3ec */
+ 3.53090261098577946927e-08, /* 0x3e62f4d5e9a98f34 */
+ 3.80080516835568242269e-08, /* 0x3e6467c96ecc5cbe */
+ 5.66961038386146408282e-08, /* 0x3e6e7040d03dec5a */
+ 4.42287063097349852717e-08, /* 0x3e67bebf4282de36 */
+ 3.45294525105681104660e-08, /* 0x3e6289b11aeb783f */
+ 2.47132034530447431509e-08, /* 0x3e5a891d1772f538 */
+ 3.59655343422487209774e-08, /* 0x3e634f10be1fb591 */
+ 5.51581770357780862071e-08, /* 0x3e6d9ce1d316eb93 */
+ 3.60171867511861372793e-08, /* 0x3e63562a19a9c442 */
+ 1.94511067964296180547e-08, /* 0x3e54e2adf548084c */
+ 1.54137376631349347838e-08, /* 0x3e508ce55cc8c97a */
+ 3.93171034490174464173e-09, /* 0x3e30e2f613e85bda */
+ 5.52990607758839766440e-08, /* 0x3e6db03ebb0227bf */
+ 3.29990737637586136511e-08, /* 0x3e61b75bb09cb098 */
+ 1.18436010922446096216e-08, /* 0x3e496f16abb9df22 */
+ 4.04248680368301346709e-08, /* 0x3e65b3f399411c62 */
+ 2.27418915900284316293e-08, /* 0x3e586b3e59f65355 */
+ 1.70263791333409206020e-08, /* 0x3e52482ceae1ac12 */
+ 5.76999904754328540596e-08}; /* 0x3e6efa39ef35793c */
+
+ /* log2_lead and log2_tail sum to an extra-precise version
+ of log(2) */
+ static const double
+ log2_lead = 6.93147122859954833984e-01, /* 0x3fe62e42e0000000 */
+ log2_tail = 5.76999904754328540596e-08; /* 0x3e6efa39ef35793c */
+
+ static const double
+ /* Approximating polynomial coefficients for x near 0.0 */
+ ca_1 = 8.33333333333317923934e-02, /* 0x3fb55555555554e6 */
+ ca_2 = 1.25000000037717509602e-02, /* 0x3f89999999bac6d4 */
+ ca_3 = 2.23213998791944806202e-03, /* 0x3f62492307f1519f */
+ ca_4 = 4.34887777707614552256e-04, /* 0x3f3c8034c85dfff0 */
+
+ /* Approximating polynomial coefficients for other x */
+ cb_1 = 8.33333333333333593622e-02, /* 0x3fb5555555555557 */
+ cb_2 = 1.24999999978138668903e-02, /* 0x3f89999999865ede */
+ cb_3 = 2.23219810758559851206e-03; /* 0x3f6249423bd94741 */
+
+ /* The values exp(-1/16)-1 and exp(1/16)-1 */
+ static const double
+ log1p_thresh1 = -6.05869371865242201114e-02, /* 0xbfaf0540438fd5c4 */
+ log1p_thresh2 = 6.44944589178594318568e-02; /* 0x3fb082b577d34ed8 */
+
+
+ GET_BITS_DP64(x, ux);
+ ax = ux & ~SIGNBIT_DP64;
+
+ if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_DP64)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, ux|0x0008000000000000, _DOMAIN,
+ 0, EDOM, x, 0.0);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is infinity */
+ if (ux & SIGNBIT_DP64)
+ /* x is negative infinity. Return a NaN. */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return retval_errno_edom(x);
+#endif
+ else
+ return x;
+ }
+ }
+ else if (ux >= 0xbff0000000000000)
+ {
+ /* x <= -1.0 */
+ if (ux > 0xbff0000000000000)
+ {
+ /* x is less than -1.0. Return a NaN. */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, INDEFBITPATT_DP64, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return retval_errno_edom(x);
+#endif
+ }
+ else
+ {
+ /* x is exactly -1.0. Return -infinity with div-by-zero flag. */
+#ifdef WINDOWS
+ return handle_error(_FUNCNAME, NINFBITPATT_DP64, _SING,
+ AMD_F_DIVBYZERO, ERANGE, x, 0.0);
+#else
+ return retval_errno_erange_overflow(x);
+#endif
+ }
+ }
+ else if (ax < 0x3ca0000000000000)
+ {
+ if (ax == 0x0000000000000000)
+ {
+ /* x is +/-zero. Return the same zero. */
+ return x;
+ }
+ else
+ /* abs(x) is less than epsilon. Return x with inexact. */
+ return val_with_flags(x, AMD_F_INEXACT);
+ }
+
+
+ if (x < log1p_thresh1 || x > log1p_thresh2)
+ {
+ /* x is outside the range [exp(-1/16)-1, exp(1/16)-1] */
+ /*
+ First, we decompose the argument x to the form
+ 1 + x = 2**M * (F1 + F2),
+ where 1 <= F1+F2 < 2, M has the value of an integer,
+ F1 = 1 + j/64, j ranges from 0 to 64, and |F2| <= 1/128.
+
+ Second, we approximate log( 1 + F2/F1 ) by an odd polynomial
+ in U, where U = 2 F2 / (2 F1 + F2).
+ Note that log( 1 + F2/F1 ) = log( 1 + U/2 ) - log( 1 - U/2 ).
+ The core approximation calculates
+ Poly = [log( 1 + U/2 ) - log( 1 - U/2 )]/U - 1.
+ Note that log(1 + U/2) - log(1 - U/2) = 2 arctanh ( U/2 ),
+ thus, Poly = 2 arctanh( U/2 ) / U - 1.
+
+ It is not hard to see that
+ log(x) = M*log(2) + log(F1) + log( 1 + F2/F1 ).
+ Hence, we return Z1 = log(F1), and Z2 = log( 1 + F2/F1).
+ The values of log(F1) are calculated beforehand and stored
+ in the program.
+ */
+
+ f = 1.0 + x;
+ GET_BITS_DP64(f, ux);
+
+ /* Store the exponent of x in xexp and put
+ f into the range [1.0,2.0) */
+ xexp = (int)((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+ PUT_BITS_DP64((ux & MANTBITS_DP64) | ONEEXPBITS_DP64, f);
+
+ /* Now (1+x) = 2**(xexp) * f, 1 <= f < 2. */
+
+ /* Set index to be the nearest integer to 64*f */
+ /* 64 <= index <= 128 */
+ /*
+ r = 64.0 * f;
+ index = (int)(r + 0.5);
+ */
+ /* This code instead of the above can save several cycles.
+ It only works because 64 <= r < 128, so
+ the nearest integer is always contained in exactly
+ 7 bits, and the right shift is always the same. */
+ index = (int)((((ux & 0x000fc00000000000) | 0x0010000000000000) >> 46)
+ + ((ux & 0x0000200000000000) >> 45));
+
+ f1 = index * 0.015625; /* 0.015625 = 1/64 */
+ index -= 64;
+
+ /* Now take great care to compute f2 such that f1 + f2 = f */
+ if (xexp <= -2 || xexp >= MANTLENGTH_DP64 + 8)
+ {
+ f2 = f - f1;
+ }
+ else
+ {
+ /* Create the number m2 = 2.0^(-xexp) */
+ ux = (unsigned long long)(0x3ff - xexp) << EXPSHIFTBITS_DP64;
+ PUT_BITS_DP64(ux,m2);
+ if (xexp <= MANTLENGTH_DP64 - 1)
+ {
+ f2 = (m2 - f1) + m2*x;
+ }
+ else
+ {
+ f2 = (m2*x - f1) + m2;
+ }
+ }
+
+ /* At this point, x = 2**xexp * ( f1 + f2 ) where
+ f1 = j/64, j = 1, 2, ..., 64 and |f2| <= 1/128. */
+
+ z1 = ln_lead_table[index];
+ q = ln_tail_table[index];
+
+ /* Calculate u = 2 f2 / ( 2 f1 + f2 ) = f2 / ( f1 + 0.5*f2 ) */
+ u = f2 / (f1 + 0.5 * f2);
+
+ /* Here, |u| <= 2(exp(1/16)-1) / (exp(1/16)+1).
+ The core approximation calculates
+ poly = [log(1 + u/2) - log(1 - u/2)]/u - 1 */
+ v = u * u;
+ poly = (v * (cb_1 + v * (cb_2 + v * cb_3)));
+ z2 = q + (u + u * poly);
+
+ /* Now z1,z2 is an extra-precise approximation of log(f). */
+
+ /* Add xexp * log(2) to z1,z2 to get the result log(1+x).
+ The computed r1 is not subject to rounding error because
+ xexp has at most 10 significant bits, log(2) has 24 significant
+ bits, and z1 has up to 24 bits; and the exponents of z1
+ and z2 differ by at most 6. */
+ r1 = (xexp * log2_lead + z1);
+ r2 = (xexp * log2_tail + z2);
+ /* Natural log(1+x) */
+ return r1 + r2;
+ }
+ else
+ {
+ /* Arguments close to 0.0 are handled separately to maintain
+ accuracy.
+
+ The approximation in this region exploits the identity
+ log( 1 + r ) = log( 1 + u/2 ) - log( 1 - u/2 ), where
+ u = 2r / (2+r).
+ Note that the right hand side has an odd Taylor series expansion
+ which converges much faster than the Taylor series expansion of
+ log( 1 + r ) in r. Thus, we approximate log( 1 + r ) by
+ u + A1 * u^3 + A2 * u^5 + ... + An * u^(2n+1).
+
+ One subtlety is that since u cannot be calculated from
+ r exactly, the rounding error in the first u should be
+ avoided if possible. To accomplish this, we observe that
+ u = r - r*r/(2+r).
+ Since x (=r) is the input argument, and thus presumed exact,
+ the formula above approximates u accurately because
+ u = r - correction,
+ and the magnitude of "correction" (of the order of r*r)
+ is small.
+ With these observations, we will approximate log( 1 + r ) by
+ r + ( (A1*u^3 + ... + An*u^(2n+1)) - correction ).
+
+ We approximate log(1+r) by an odd polynomial in u, where
+ u = 2r/(2+r) = r - r*r/(2+r).
+ */
+ r = x;
+ u = r / (2.0 + r);
+ correction = r * u;
+ u = u + u;
+ v = u * u;
+ r1 = r;
+ r2 = (u * v * (ca_1 + v * (ca_2 + v * (ca_3 + v * ca_4))) - correction);
+ return r1 + r2;
+ }
+}
+
+weak_alias (__log1p, log1p)
diff --git a/src/log1pf.c b/src/log1pf.c
new file mode 100644
index 0000000..375a846
--- /dev/null
+++ b/src/log1pf.c
@@ -0,0 +1,416 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_NANF_WITH_FLAGS
+#define USE_VALF_WITH_FLAGS
+#define USE_INFINITYF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NANF_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+#undef USE_INFINITYF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline float retval_errno_erange_overflow(float x)
+{
+ struct exception exc;
+ exc.arg1 = (double)x;
+ exc.arg2 = (double)x;
+ exc.type = SING;
+ exc.name = (char *)"log1pf";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = -HUGE;
+ else
+ exc.retval = -infinityf_with_flags(AMD_F_DIVBYZERO);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(ERANGE);
+ else if (!matherr(&exc))
+ __set_errno(ERANGE);
+ return exc.retval;
+}
+
+/* Deal with errno for out-of-range argument */
+static inline float retval_errno_edom(float x)
+{
+ struct exception exc;
+ exc.arg1 = (double)x;
+ exc.arg2 = (double)x;
+ exc.type = DOMAIN;
+ exc.name = (char *)"log1pf";
+ if (_LIB_VERSION == _SVID_)
+ exc.retval = -HUGE;
+ else
+ exc.retval = nanf_with_flags(AMD_F_INVALID);
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(EDOM);
+ else if (!matherr(&exc))
+ {
+ if(_LIB_VERSION == _SVID_)
+ (void)fputs("log1pf: DOMAIN error\n", stderr);
+ __set_errno(EDOM);
+ }
+ return exc.retval;
+}
+#endif
+
+#undef _FUNCNAME
+#define _FUNCNAME "log1pf"
+
+float FN_PROTOTYPE(log1pf)(float x)
+{
+
+ int xexp;
+ double dx, r, f, f1, f2, q, u, v, z1, z2, poly, m2;
+ int index;
+ unsigned int ux, ax;
+ unsigned long long lux;
+
+ /*
+ Computes natural log(1+x) for float arguments. Algorithm is
+ basically a promotion of the arguments to double followed
+ by an inlined version of the double algorithm, simplified
+ for efficiency (see log1p_amd.c). Simplifications include:
+ * Special algorithm for arguments near 0.0 not required
+ * Scaling of denormalised arguments not required
+ * Shorter core series approximations used
+ Note that we use a lookup table of size 64 rather than 128,
+ and compensate by having extra terms in the minimax polynomial
+ for the kernel approximation.
+ */
+
+/* Arrays ln_lead_table and ln_tail_table contain
+ leading and trailing parts respectively of precomputed
+ values of natural log(1+i/64), for i = 0, 1, ..., 64.
+ ln_lead_table contains the first 24 bits of precision,
+ and ln_tail_table contains a further 53 bits precision. */
+
+ static const double ln_lead_table[65] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.55041813850402832031e-02, /* 0x3f8fc0a800000000 */
+ 3.07716131210327148438e-02, /* 0x3f9f829800000000 */
+ 4.58095073699951171875e-02, /* 0x3fa7745800000000 */
+ 6.06245994567871093750e-02, /* 0x3faf0a3000000000 */
+ 7.52233862876892089844e-02, /* 0x3fb341d700000000 */
+ 8.96121263504028320312e-02, /* 0x3fb6f0d200000000 */
+ 1.03796780109405517578e-01, /* 0x3fba926d00000000 */
+ 1.17783010005950927734e-01, /* 0x3fbe270700000000 */
+ 1.31576299667358398438e-01, /* 0x3fc0d77e00000000 */
+ 1.45181953907012939453e-01, /* 0x3fc2955280000000 */
+ 1.58604979515075683594e-01, /* 0x3fc44d2b00000000 */
+ 1.71850204467773437500e-01, /* 0x3fc5ff3000000000 */
+ 1.84922337532043457031e-01, /* 0x3fc7ab8900000000 */
+ 1.97825729846954345703e-01, /* 0x3fc9525a80000000 */
+ 2.10564732551574707031e-01, /* 0x3fcaf3c900000000 */
+ 2.23143517971038818359e-01, /* 0x3fcc8ff780000000 */
+ 2.35566020011901855469e-01, /* 0x3fce270700000000 */
+ 2.47836112976074218750e-01, /* 0x3fcfb91800000000 */
+ 2.59957492351531982422e-01, /* 0x3fd0a324c0000000 */
+ 2.71933674812316894531e-01, /* 0x3fd1675c80000000 */
+ 2.83768117427825927734e-01, /* 0x3fd22941c0000000 */
+ 2.95464158058166503906e-01, /* 0x3fd2e8e280000000 */
+ 3.07025015354156494141e-01, /* 0x3fd3a64c40000000 */
+ 3.18453729152679443359e-01, /* 0x3fd4618bc0000000 */
+ 3.29753279685974121094e-01, /* 0x3fd51aad80000000 */
+ 3.40926527976989746094e-01, /* 0x3fd5d1bd80000000 */
+ 3.51976394653320312500e-01, /* 0x3fd686c800000000 */
+ 3.62905442714691162109e-01, /* 0x3fd739d7c0000000 */
+ 3.73716354370117187500e-01, /* 0x3fd7eaf800000000 */
+ 3.84411692619323730469e-01, /* 0x3fd89a3380000000 */
+ 3.94993782043457031250e-01, /* 0x3fd9479400000000 */
+ 4.05465066432952880859e-01, /* 0x3fd9f323c0000000 */
+ 4.15827870368957519531e-01, /* 0x3fda9cec80000000 */
+ 4.26084339618682861328e-01, /* 0x3fdb44f740000000 */
+ 4.36236739158630371094e-01, /* 0x3fdbeb4d80000000 */
+ 4.46287095546722412109e-01, /* 0x3fdc8ff7c0000000 */
+ 4.56237375736236572266e-01, /* 0x3fdd32fe40000000 */
+ 4.66089725494384765625e-01, /* 0x3fddd46a00000000 */
+ 4.75845873355865478516e-01, /* 0x3fde744240000000 */
+ 4.85507786273956298828e-01, /* 0x3fdf128f40000000 */
+ 4.95077252388000488281e-01, /* 0x3fdfaf5880000000 */
+ 5.04556000232696533203e-01, /* 0x3fe02552a0000000 */
+ 5.13945698738098144531e-01, /* 0x3fe0723e40000000 */
+ 5.23248136043548583984e-01, /* 0x3fe0be72e0000000 */
+ 5.32464742660522460938e-01, /* 0x3fe109f380000000 */
+ 5.41597247123718261719e-01, /* 0x3fe154c3c0000000 */
+ 5.50647079944610595703e-01, /* 0x3fe19ee6a0000000 */
+ 5.59615731239318847656e-01, /* 0x3fe1e85f40000000 */
+ 5.68504691123962402344e-01, /* 0x3fe23130c0000000 */
+ 5.77315330505371093750e-01, /* 0x3fe2795e00000000 */
+ 5.86049020290374755859e-01, /* 0x3fe2c0e9e0000000 */
+ 5.94707071781158447266e-01, /* 0x3fe307d720000000 */
+ 6.03290796279907226562e-01, /* 0x3fe34e2880000000 */
+ 6.11801505088806152344e-01, /* 0x3fe393e0c0000000 */
+ 6.20240390300750732422e-01, /* 0x3fe3d90260000000 */
+ 6.28608644008636474609e-01, /* 0x3fe41d8fe0000000 */
+ 6.36907458305358886719e-01, /* 0x3fe4618bc0000000 */
+ 6.45137906074523925781e-01, /* 0x3fe4a4f840000000 */
+ 6.53301239013671875000e-01, /* 0x3fe4e7d800000000 */
+ 6.61398470401763916016e-01, /* 0x3fe52a2d20000000 */
+ 6.69430613517761230469e-01, /* 0x3fe56bf9c0000000 */
+ 6.77398800849914550781e-01, /* 0x3fe5ad4040000000 */
+ 6.85303986072540283203e-01, /* 0x3fe5ee02a0000000 */
+ 6.93147122859954833984e-01}; /* 0x3fe62e42e0000000 */
+
+ static const double ln_tail_table[65] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 5.15092497094772879206e-09, /* 0x3e361f807c79f3db */
+ 4.55457209735272790188e-08, /* 0x3e6873c1980267c8 */
+ 2.86612990859791781788e-08, /* 0x3e5ec65b9f88c69e */
+ 2.23596477332056055352e-08, /* 0x3e58022c54cc2f99 */
+ 3.49498983167142274770e-08, /* 0x3e62c37a3a125330 */
+ 3.23392843005887000414e-08, /* 0x3e615cad69737c93 */
+ 1.35722380472479366661e-08, /* 0x3e4d256ab1b285e9 */
+ 2.56504325268044191098e-08, /* 0x3e5b8abcb97a7aa2 */
+ 5.81213608741512136843e-08, /* 0x3e6f34239659a5dc */
+ 5.59374849578288093334e-08, /* 0x3e6e07fd48d30177 */
+ 5.06615629004996189970e-08, /* 0x3e6b32df4799f4f6 */
+ 5.24588857848400955725e-08, /* 0x3e6c29e4f4f21cf8 */
+ 9.61968535632653505972e-10, /* 0x3e1086c848df1b59 */
+ 1.34829655346594463137e-08, /* 0x3e4cf456b4764130 */
+ 3.65557749306383026498e-08, /* 0x3e63a02ffcb63398 */
+ 3.33431709374069198903e-08, /* 0x3e61e6a6886b0976 */
+ 5.13008650536088382197e-08, /* 0x3e6b8abcb97a7aa2 */
+ 5.09285070380306053751e-08, /* 0x3e6b578f8aa35552 */
+ 3.20853940845502057341e-08, /* 0x3e6139c871afb9fc */
+ 4.06713248643004200446e-08, /* 0x3e65d5d30701ce64 */
+ 5.57028186706125221168e-08, /* 0x3e6de7bcb2d12142 */
+ 5.48356693724804282546e-08, /* 0x3e6d708e984e1664 */
+ 1.99407553679345001938e-08, /* 0x3e556945e9c72f36 */
+ 1.96585517245087232086e-09, /* 0x3e20e2f613e85bda */
+ 6.68649386072067321503e-09, /* 0x3e3cb7e0b42724f6 */
+ 5.89936034642113390002e-08, /* 0x3e6fac04e52846c7 */
+ 2.85038578721554472484e-08, /* 0x3e5e9b14aec442be */
+ 5.09746772910284482606e-08, /* 0x3e6b5de8034e7126 */
+ 5.54234668933210171467e-08, /* 0x3e6dc157e1b259d3 */
+ 6.29100830926604004874e-09, /* 0x3e3b05096ad69c62 */
+ 2.61974119468563937716e-08, /* 0x3e5c2116faba4cdd */
+ 4.16752115011186398935e-08, /* 0x3e665fcc25f95b47 */
+ 2.47747534460820790327e-08, /* 0x3e5a9a08498d4850 */
+ 5.56922172017964209793e-08, /* 0x3e6de647b1465f77 */
+ 2.76162876992552906035e-08, /* 0x3e5da71b7bf7861d */
+ 7.08169709942321478061e-09, /* 0x3e3e6a6886b09760 */
+ 5.77453510221151779025e-08, /* 0x3e6f0075eab0ef64 */
+ 4.43021445893361960146e-09, /* 0x3e33071282fb989b */
+ 3.15140984357495864573e-08, /* 0x3e60eb43c3f1bed2 */
+ 2.95077445089736670973e-08, /* 0x3e5faf06ecb35c84 */
+ 1.44098510263167149349e-08, /* 0x3e4ef1e63db35f68 */
+ 1.05196987538551827693e-08, /* 0x3e469743fb1a71a5 */
+ 5.23641361722697546261e-08, /* 0x3e6c1cdf404e5796 */
+ 7.72099925253243069458e-09, /* 0x3e4094aa0ada625e */
+ 5.62089493829364197156e-08, /* 0x3e6e2d4c96fde3ec */
+ 3.53090261098577946927e-08, /* 0x3e62f4d5e9a98f34 */
+ 3.80080516835568242269e-08, /* 0x3e6467c96ecc5cbe */
+ 5.66961038386146408282e-08, /* 0x3e6e7040d03dec5a */
+ 4.42287063097349852717e-08, /* 0x3e67bebf4282de36 */
+ 3.45294525105681104660e-08, /* 0x3e6289b11aeb783f */
+ 2.47132034530447431509e-08, /* 0x3e5a891d1772f538 */
+ 3.59655343422487209774e-08, /* 0x3e634f10be1fb591 */
+ 5.51581770357780862071e-08, /* 0x3e6d9ce1d316eb93 */
+ 3.60171867511861372793e-08, /* 0x3e63562a19a9c442 */
+ 1.94511067964296180547e-08, /* 0x3e54e2adf548084c */
+ 1.54137376631349347838e-08, /* 0x3e508ce55cc8c97a */
+ 3.93171034490174464173e-09, /* 0x3e30e2f613e85bda */
+ 5.52990607758839766440e-08, /* 0x3e6db03ebb0227bf */
+ 3.29990737637586136511e-08, /* 0x3e61b75bb09cb098 */
+ 1.18436010922446096216e-08, /* 0x3e496f16abb9df22 */
+ 4.04248680368301346709e-08, /* 0x3e65b3f399411c62 */
+ 2.27418915900284316293e-08, /* 0x3e586b3e59f65355 */
+ 1.70263791333409206020e-08, /* 0x3e52482ceae1ac12 */
+ 5.76999904754328540596e-08}; /* 0x3e6efa39ef35793c */
+
+ static const double
+ log2 = 6.931471805599453e-01, /* 0x3fe62e42fefa39ef */
+
+ /* Approximating polynomial coefficients */
+ cb_1 = 8.33333333333333593622e-02, /* 0x3fb5555555555557 */
+ cb_2 = 1.24999999978138668903e-02; /* 0x3f89999999865ede */
+
+ GET_BITS_SP32(x, ux);
+ ax = ux & ~SIGNBIT_SP32;
+
+ if ((ux & EXPBITS_SP32) == EXPBITS_SP32)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_SP32)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, ux|0x00400000, _DOMAIN,
+ 0, EDOM, x, 0.0F);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is infinity */
+ if (ux & SIGNBIT_SP32)
+ {
+ /* x is negative infinity. Return a NaN. */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return retval_errno_edom(x);
+#endif
+ }
+ else
+ return x;
+ }
+ }
+ else if (ux >= 0xbf800000)
+ {
+ /* x <= -1.0 */
+ if (ux > 0xbf800000)
+ {
+ /* x is less than -1.0. Return a NaN. */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, INDEFBITPATT_SP32, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return retval_errno_edom(x);
+#endif
+ }
+ else
+ {
+ /* x is exactly -1.0. Return -infinity with div-by-zero flag. */
+#ifdef WINDOWS
+ return handle_errorf(_FUNCNAME, NINFBITPATT_SP32, _SING,
+ AMD_F_DIVBYZERO, ERANGE, x, 0.0F);
+#else
+ return retval_errno_erange_overflow(x);
+#endif
+ }
+ }
+ else if (ax < 0x33800000)
+ {
+ if (ax == 0x00000000)
+ {
+ /* x is +/-zero. Return the same zero. */
+ return x;
+ }
+ else
+ /* abs(x) is less than float epsilon. Return x with inexact. */
+ return valf_with_flags(x, AMD_F_INEXACT);
+ }
+
+ dx = x;
+ /*
+ First, we decompose the argument dx to the form
+ 1 + dx = 2**M * (F1 + F2),
+ where 1 <= F1+F2 < 2, M has the value of an integer,
+ F1 = 1 + j/64, j ranges from 0 to 64, and |F2| <= 1/128.
+
+ Second, we approximate log( 1 + F2/F1 ) by an odd polynomial
+ in U, where U = 2 F2 / (2 F2 + F1).
+ Note that log( 1 + F2/F1 ) = log( 1 + U/2 ) - log( 1 - U/2 ).
+ The core approximation calculates
+ Poly = [log( 1 + U/2 ) - log( 1 - U/2 )]/U - 1.
+ Note that log(1 + U/2) - log(1 - U/2) = 2 arctanh ( U/2 ),
+ thus, Poly = 2 arctanh( U/2 ) / U - 1.
+
+ It is not hard to see that
+ log(dx) = M*log(2) + log(F1) + log( 1 + F2/F1 ).
+ Hence, we return Z1 = log(F1), and Z2 = log( 1 + F2/F1).
+ The values of log(F1) are calculated beforehand and stored
+ in the program.
+ */
+
+ f = 1.0 + dx;
+ GET_BITS_DP64(f, lux);
+
+ /* Store the exponent of f = 1 + dx in xexp and put
+ f into the range [1.0,2.0) */
+ xexp = (int)((lux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+ PUT_BITS_DP64((lux & MANTBITS_DP64) | ONEEXPBITS_DP64, f);
+
+ /* Now (1+dx) = 2**(xexp) * f, 1 <= f < 2. */
+
+ /* Set index to be the nearest integer to 64*f */
+ /* 64 <= index <= 128 */
+ /*
+ r = 64.0 * f;
+ index = (int)(r + 0.5);
+ */
+ /* This code instead of the above can save several cycles.
+ It only works because 64 <= r < 128, so
+ the nearest integer is always contained in exactly
+ 7 bits, and the right shift is always the same. */
+ index = (int)((((lux & 0x000fc00000000000) | 0x0010000000000000) >> 46)
+ + ((lux & 0x0000200000000000) >> 45));
+
+ f1 = index * 0.015625; /* 0.015625 = 1/64 */
+ index -= 64;
+
+ /* Now take great care to compute f2 such that f1 + f2 = f */
+ if (xexp <= -2 || xexp >= MANTLENGTH_DP64 + 8)
+ {
+ f2 = f - f1;
+ }
+ else
+ {
+ /* Create the number m2 = 2.0^(-xexp) */
+ lux = (unsigned long long)(0x3ff - xexp) << EXPSHIFTBITS_DP64;
+ PUT_BITS_DP64(lux,m2);
+ if (xexp <= MANTLENGTH_DP64 - 1)
+ {
+ f2 = (m2 - f1) + m2*dx;
+ }
+ else
+ {
+ f2 = (m2*dx - f1) + m2;
+ }
+ }
+
+ /* At this point, dx = 2**xexp * ( f1 + f2 ) where
+ f1 = j/64, j = 1, 2, ..., 64 and |f2| <= 1/128. */
+
+ z1 = ln_lead_table[index];
+ q = ln_tail_table[index];
+
+ /* Calculate u = 2 f2 / ( 2 f1 + f2 ) = f2 / ( f1 + 0.5*f2 ) */
+ u = f2 / (f1 + 0.5 * f2);
+
+ /* Here, |u| <= 2(exp(1/16)-1) / (exp(1/16)+1).
+ The core approximation calculates
+ poly = [log(1 + u/2) - log(1 - u/2)]/u - 1 */
+ v = u * u;
+ poly = (v * (cb_1 + v * cb_2));
+ z2 = q + (u + u * poly);
+
+ /* Now z1,z2 is an extra-precise approximation of log(f). */
+
+ /* Add xexp * log(2) to z1,z2 to get the result log(1+x). */
+ r = xexp * log2 + z1 + z2;
+ /* Natural log(1+x) */
+ return (float)r;
+}
+
+weak_alias (__log1pf, log1pf)
diff --git a/src/log_special.c b/src/log_special.c
new file mode 100644
index 0000000..53a92b8
--- /dev/null
+++ b/src/log_special.c
@@ -0,0 +1,141 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+#ifdef __x86_64__
+
+#include <emmintrin.h>
+#include <math.h>
+#include <errno.h>
+
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+// y = log10f(x)
+// y = log10(x)
+// y = logf(x)
+// y = log(x)
+
+// these codes and the ones in the related .S or .asm files have to match
+#define LOG_X_ZERO 1
+#define LOG_X_NEG 2
+#define LOG_X_NAN 3
+
+static float _logf_special_common(float x, float y, U32 code, const char *name)
+{
+ switch(code)
+ {
+ case LOG_X_ZERO:
+ {
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO);
+ __amd_handle_errorf(SING, ERANGE, name, x, 0, 0.0f, 0, y, 0);
+ }
+ break;
+
+ case LOG_X_NEG:
+ {
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+ __amd_handle_errorf(DOMAIN, EDOM, name, x, 0, 0.0f, 0, y, 0);
+ }
+ break;
+
+ case LOG_X_NAN:
+ {
+#ifdef WIN64
+ // y is assumed to be qnan, only check x for snan
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, name, x, is_x_snan, 0.0f, 0, y, 0);
+#else
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+ }
+ break;
+ }
+
+ return y;
+}
+
+float _logf_special(float x, float y, U32 code)
+{
+ return _logf_special_common(x, y, code, "logf");
+}
+
+float _log10f_special(float x, float y, U32 code)
+{
+ return _logf_special_common(x, y, code, "log10f");
+}
+
+float _log2f_special(float x, float y, U32 code)
+{
+ return _logf_special_common(x, y, code, "log2f");
+}
+
+static double _log_special_common(double x, double y, U32 code,
+ const char *name)
+{
+ switch(code)
+ {
+ case LOG_X_ZERO:
+ {
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO);
+ __amd_handle_error(SING, ERANGE, name, x, 0.0, y);
+ }
+ break;
+
+ case LOG_X_NEG:
+ {
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+ __amd_handle_error(DOMAIN, EDOM, name, x, 0.0, y);
+ }
+ break;
+
+ case LOG_X_NAN:
+ {
+#ifdef WIN64
+ __amd_handle_error(DOMAIN, EDOM, name, x, 0.0, y);
+#else
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+ }
+ break;
+ }
+
+ return y;
+}
+
+double _log_special(double x, double y, U32 code)
+{
+ return _log_special_common(x, y, code, "log");
+}
+
+double _log10_special(double x, double y, U32 code)
+{
+ return _log_special_common(x, y, code, "log10");
+}
+
+double _log2_special(double x, double y, U32 code)
+{
+ return _log_special_common(x, y, code, "log2");
+}
+
+#endif /* __x86_64__ */
diff --git a/src/logb.c b/src/logb.c
new file mode 100644
index 0000000..7c75ef1
--- /dev/null
+++ b/src/logb.c
@@ -0,0 +1,102 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_INFINITY_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#endif
+
+#ifdef WINDOWS
+double FN_PROTOTYPE(logb)(double x)
+#else
+double FN_PROTOTYPE(logb)(double x)
+#endif
+{
+
+ unsigned long long ux;
+ long long u;
+ GET_BITS_DP64(x, ux);
+ u = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+ if ((ux & ~SIGNBIT_DP64) == 0)
+ /* x is +/-zero. Return -infinity with div-by-zero flag. */
+#ifdef WINDOWS
+ return handle_error("logb", NINFBITPATT_DP64, _SING,
+ AMD_F_DIVBYZERO, ERANGE, x, 0.0);
+#else
+ return -infinity_with_flags(AMD_F_DIVBYZERO);
+#endif
+ else if (EMIN_DP64 <= u && u <= EMAX_DP64)
+ /* x is a normal number */
+ return (double)u;
+ else if (u > EMAX_DP64)
+ {
+ /* x is infinity or NaN */
+ if ((ux & MANTBITS_DP64) == 0)
+#ifdef WINDOWS
+ /* x is +/-infinity. For VC++, return infinity of same sign. */
+ return x;
+#else
+ /* x is +/-infinity. Return +infinity with no flags. */
+ return infinity_with_flags(0);
+#endif
+ else
+ /* x is NaN, result is NaN */
+#ifdef WINDOWS
+ return handle_error("logb", ux|0x0008000000000000, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is denormalized. */
+#ifdef FOLLOW_IEEE754_LOGB
+ /* Return the value of the minimum exponent to ensure that
+ the relationship between logb and scalb, defined in
+ IEEE 754, holds. */
+ return EMIN_DP64;
+#else
+ /* Follow the rule set by IEEE 854 for logb */
+ ux &= MANTBITS_DP64;
+ u = EMIN_DP64;
+ while (ux < IMPBIT_DP64)
+ {
+ ux <<= 1;
+ u--;
+ }
+ return (double)u;
+#endif
+ }
+
+}
+
+weak_alias (__logb, logb)
diff --git a/src/logbf.c b/src/logbf.c
new file mode 100644
index 0000000..d64e531
--- /dev/null
+++ b/src/logbf.c
@@ -0,0 +1,100 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_INFINITYF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_INFINITYF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#endif
+
+#ifdef WINDOWS
+float FN_PROTOTYPE(logbf)(float x)
+#else
+float FN_PROTOTYPE(logbf)(float x)
+#endif
+{
+ unsigned int ux;
+ int u;
+ GET_BITS_SP32(x, ux);
+ u = ((ux & EXPBITS_SP32) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+ if ((ux & ~SIGNBIT_SP32) == 0)
+ /* x is +/-zero. Return -infinity with div-by-zero flag. */
+#ifdef WINDOWS
+ return handle_errorf("logbf", NINFBITPATT_SP32, _SING,
+ AMD_F_DIVBYZERO, ERANGE, x, 0.0F);
+#else
+ return -infinityf_with_flags(AMD_F_DIVBYZERO);
+#endif
+ else if (EMIN_SP32 <= u && u <= EMAX_SP32)
+ /* x is a normal number */
+ return (float)u;
+ else if (u > EMAX_SP32)
+ {
+ /* x is infinity or NaN */
+ if ((ux & MANTBITS_SP32) == 0)
+#ifdef WINDOWS
+ /* x is +/-infinity. For VC++, return infinity of same sign. */
+ return x;
+#else
+ /* x is +/-infinity. Return +infinity with no flags. */
+ return infinityf_with_flags(0);
+#endif
+ else
+ /* x is NaN, result is NaN */
+#ifdef WINDOWS
+ return handle_errorf("logbf", ux|0x00400000, _DOMAIN,
+ AMD_F_INVALID, EDOM, x, 0.0F);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is denormalized. */
+#ifdef FOLLOW_IEEE754_LOGB
+ /* Return the value of the minimum exponent to ensure that
+ the relationship between logb and scalb, defined in
+ IEEE 754, holds. */
+ return EMIN_SP32;
+#else
+ /* Follow the rule set by IEEE 854 for logb */
+ ux &= MANTBITS_SP32;
+ u = EMIN_SP32;
+ while (ux < IMPBIT_SP32)
+ {
+ ux <<= 1;
+ u--;
+ }
+ return (float)u;
+#endif
+ }
+}
+
+weak_alias (__logbf, logbf)
diff --git a/src/lrint.c b/src/lrint.c
new file mode 100644
index 0000000..e3c0e41
--- /dev/null
+++ b/src/lrint.c
@@ -0,0 +1,62 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+long int FN_PROTOTYPE(lrint)(double x)
+{
+
+ UT64 checkbits,val_2p52;
+ checkbits.f64=x;
+
+ /* Clear the sign bit and check if the value can be rounded */
+
+ if( (checkbits.u64 & 0x7FFFFFFFFFFFFFFF) > 0x4330000000000000)
+ {
+ /* number cant be rounded raise an exception */
+ /* Number exceeds the representable range could be nan or inf also*/
+ __amd_handle_error(DOMAIN, EDOM, "lrint", x,0.0 ,(double)x);
+
+
+ return (long int) x;
+ }
+
+ val_2p52.u32[1] = (checkbits.u32[1] & 0x80000000) | 0x43300000;
+ val_2p52.u32[0] = 0;
+
+ /* Add and sub 2^52 to round the number according to the current rounding direction */
+
+ return (long int) ((x + val_2p52.f64) - val_2p52.f64);
+}
diff --git a/src/lrintf.c b/src/lrintf.c
new file mode 100644
index 0000000..abcd37b
--- /dev/null
+++ b/src/lrintf.c
@@ -0,0 +1,67 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+long int FN_PROTOTYPE(lrintf)(float x)
+{
+
+ UT32 checkbits,val_2p23;
+ checkbits.f32=x;
+
+ /* Clear the sign bit and check if the value can be rounded */
+
+ if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000)
+ {
+ /* number cant be rounded raise an exception */
+ /* Number exceeds the representable range could be nan or inf also*/
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "lrintf", x, is_x_snan, 0.0F , 0,(float)x, 0);
+ }
+
+ return (long int) x;
+ }
+
+
+ val_2p23.u32 = (checkbits.u32 & 0x80000000) | 0x4B000000;
+
+ /* Add and sub 2^23 to round the number according to the current rounding direction */
+
+ return (long int) ((x + val_2p23.f32) - val_2p23.f32);
+}
diff --git a/src/lround.c b/src/lround.c
new file mode 100644
index 0000000..dfe411d
--- /dev/null
+++ b/src/lround.c
@@ -0,0 +1,135 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+long int FN_PROTOTYPE(lround)(double d)
+{
+ UT64 u64d;
+ UT64 u64Temp,u64result;
+ int intexp, shift;
+ U64 sign;
+ long int result;
+
+ u64d.f64 = u64Temp.f64 = d;
+
+ if ((u64d.u32[1] & 0X7FF00000) == 0x7FF00000)
+ {
+ /*else the number is infinity*/
+ //Raise range or domain error
+ #ifdef WIN64
+ __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double)SIGNBIT_SP32);
+ return (long int )SIGNBIT_SP32;
+ #else
+ __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double)SIGNBIT_DP64);
+ return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/
+ #endif
+
+ }
+
+ u64Temp.u32[1] &= 0x7FFFFFFF;
+ intexp = (u64d.u32[1] & 0x7FF00000) >> 20;
+ sign = u64d.u64 & 0x8000000000000000;
+ intexp -= 0x3FF;
+
+ /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */
+ if (intexp < -1)
+ return (0);
+
+#ifdef WIN64
+ /* 1.0 x 2^31 (or 2^63) is already too large */
+ if (intexp >= 31)
+ {
+ /*Based on the sign of the input value return the MAX and MIN*/
+ result = 0x80000000; /*Return LONG MIN*/
+
+ __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double) result);
+
+ return result;
+ }
+
+
+#else
+ /* 1.0 x 2^31 (or 2^63) is already too large */
+ if (intexp >= 63)
+ {
+ /*Based on the sign of the input value return the MAX and MIN*/
+ result = 0x8000000000000000; /*Return LONG MIN*/
+
+ __amd_handle_error(DOMAIN, EDOM, "lround", d, 0.0 , (double) result);
+
+ return result;
+ }
+
+#endif
+
+ u64result.f64 = u64Temp.f64;
+ /* >= 2^52 is already an exact integer */
+#ifdef WIN64
+ if (intexp < 23)
+#else
+ if (intexp < 52)
+#endif
+ {
+ /* add 0.5, extraction below will truncate */
+ u64result.f64 = u64Temp.f64 + 0.5;
+ }
+
+ intexp = ((u64result.u32[1] >> 20) & 0x7ff) - 0x3FF;
+
+ u64result.u32[1] &= 0xfffff;
+ u64result.u32[1] |= 0x00100000; /*Mask the last exp bit to 1*/
+ shift = intexp - 52;
+
+#ifdef WIN64
+ /*The shift value will always be negative.*/
+ u64result.u64 = u64result.u64 >> (-shift);
+ /*Result will be stored in the lower word due to the shift being performed*/
+ result = u64result.u32[0];
+#else
+ if(shift < 0)
+ u64result.u64 = u64result.u64 >> (-shift);
+ if(shift > 0)
+ u64result.u64 = u64result.u64 << (shift);
+
+ result = u64result.u64;
+#endif
+
+
+
+ if (sign)
+ result = -result;
+
+ return result;
+}
+
diff --git a/src/lroundf.c b/src/lroundf.c
new file mode 100644
index 0000000..799e960
--- /dev/null
+++ b/src/lroundf.c
@@ -0,0 +1,147 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+long int FN_PROTOTYPE(lroundf)(float f)
+{
+ UT32 u32d;
+ UT32 u32Temp,u32result;
+ int intexp, shift;
+ U32 sign;
+ long int result;
+
+ u32d.f32 = u32Temp.f32 = f;
+ if ((u32d.u32 & 0X7F800000) == 0x7F800000)
+ {
+ /*else the number is infinity*/
+ //Raise range or domain error
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = f;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ #ifdef WIN64
+ __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)SIGNBIT_SP32, 0);
+ return (long int)SIGNBIT_SP32;
+ #else
+ __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)SIGNBIT_DP64, 0);
+ return SIGNBIT_DP64; /*GCC returns this when the number is out of range*/
+ #endif
+ }
+
+ }
+
+ u32Temp.u32 &= 0x7FFFFFFF;
+ intexp = (u32d.u32 & 0x7F800000) >> 23;
+ sign = u32d.u32 & 0x80000000;
+ intexp -= 0x7F;
+
+
+ /* 1.0 x 2^-1 is the smallest number which can be rounded to 1 */
+ if (intexp < -1)
+ return (0);
+
+
+#ifdef WIN64
+ /* 1.0 x 2^31 is already too large */
+ if (intexp >= 31)
+ {
+ result = 0x80000000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = f;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)result, 0);
+ }
+
+ return result;
+ }
+
+#else
+ /* 1.0 x 2^31 (or 2^63) is already too large */
+ if (intexp >= 63)
+ {
+ result = 0x8000000000000000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = f;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "lroundf", f, is_x_snan, 0.0F , 0,(float)result, 0);
+ }
+
+ return result;
+ }
+ #endif
+
+ u32result.f32 = u32Temp.f32;
+
+ /* >= 2^23 is already an exact integer */
+ if (intexp < 23)
+ {
+ /* add 0.5, extraction below will truncate */
+ u32result.f32 = u32Temp.f32 + 0.5F;
+ }
+ intexp = (u32result.u32 & 0x7f800000) >> 23;
+ intexp -= 0x7f;
+ u32result.u32 &= 0x7fffff;
+ u32result.u32 |= 0x00800000;
+
+ result = u32result.u32;
+
+ #ifdef WIN64
+ shift = intexp - 23;
+ #else
+
+ /*Since float is only 32 bit for higher accuracy we shift the result by 32 bits
+ * In the next step we shift an extra 32 bits in the reverse direction based
+ * on the value of intexp*/
+ result = result << 32;
+ shift = intexp - 55; /*55= 23 +32*/
+ #endif
+
+
+ if(shift < 0)
+ result = result >> (-shift);
+ if(shift > 0)
+ result = result << (shift);
+
+ if (sign)
+ result = -result;
+ return result;
+
+}
+
+
+
diff --git a/src/modf.c b/src/modf.c
new file mode 100644
index 0000000..836db46
--- /dev/null
+++ b/src/modf.c
@@ -0,0 +1,80 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+double FN_PROTOTYPE(modf)(double x, double *iptr)
+{
+ /* modf splits the argument x into integer and fraction parts,
+ each with the same sign as x. */
+
+
+ long long xexp;
+ unsigned long long ux, ax, mask;
+
+ GET_BITS_DP64(x, ux);
+ ax = ux & (~SIGNBIT_DP64);
+
+ if (ax >= 0x4340000000000000)
+ {
+ /* abs(x) is either NaN, infinity, or >= 2^53 */
+ if (ax > 0x7ff0000000000000)
+ {
+ /* x is NaN */
+ *iptr = x;
+ return x + x; /* Raise invalid if it is a signalling NaN */
+ }
+ else
+ {
+ /* x is infinity or large. Return zero with the sign of x */
+ *iptr = x;
+ PUT_BITS_DP64(ux & SIGNBIT_DP64, x);
+ return x;
+ }
+ }
+ else if (ax < 0x3ff0000000000000)
+ {
+ /* abs(x) < 1.0. Set iptr to zero with the sign of x
+ and return x. */
+ PUT_BITS_DP64(ux & SIGNBIT_DP64, *iptr);
+ return x;
+ }
+ else
+ {
+ double r;
+ unsigned long long ur;
+ xexp = ((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64;
+ /* Mask out the bits of x that we don't want */
+ mask = 1;
+ mask = (mask << (EXPSHIFTBITS_DP64 - xexp)) - 1;
+ PUT_BITS_DP64(ux & ~mask, *iptr);
+ r = x - *iptr;
+ GET_BITS_DP64(r, ur);
+ PUT_BITS_DP64(((ux & SIGNBIT_DP64)|ur), r);
+ return r;
+ }
+
+}
+
+weak_alias (__modf, modf)
diff --git a/src/modff.c b/src/modff.c
new file mode 100644
index 0000000..7e5eae7
--- /dev/null
+++ b/src/modff.c
@@ -0,0 +1,74 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+float FN_PROTOTYPE(modff)(float x, float *iptr)
+{
+ /* modff splits the argument x into integer and fraction parts,
+ each with the same sign as x. */
+
+ unsigned int ux, mask;
+ int xexp;
+
+ GET_BITS_SP32(x, ux);
+ xexp = ((ux & (~SIGNBIT_SP32)) >> EXPSHIFTBITS_SP32) - EXPBIAS_SP32;
+
+ if (xexp < 0)
+ {
+ /* abs(x) < 1.0. Set iptr to zero with the sign of x
+ and return x. */
+ PUT_BITS_SP32(ux & SIGNBIT_SP32, *iptr);
+ return x;
+ }
+ else if (xexp < EXPSHIFTBITS_SP32)
+ {
+ float r;
+ unsigned int ur;
+ /* x lies between 1.0 and 2**(24) */
+ /* Mask out the bits of x that we don't want */
+ mask = (1 << (EXPSHIFTBITS_SP32 - xexp)) - 1;
+ PUT_BITS_SP32(ux & ~mask, *iptr);
+ r = x - *iptr;
+ GET_BITS_SP32(r, ur);
+ PUT_BITS_SP32(((ux & SIGNBIT_SP32)|ur), r);
+ return r;
+ }
+ else if ((ux & (~SIGNBIT_SP32)) > 0x7f800000)
+ {
+ /* x is NaN */
+ *iptr = x;
+ return x + x; /* Raise invalid if it is a signalling NaN */
+ }
+ else
+ {
+ /* x is infinity or large. Set iptr to x and return zero
+ with the sign of x. */
+ *iptr = x;
+ PUT_BITS_SP32(ux & SIGNBIT_SP32, x);
+ return x;
+ }
+}
+
+weak_alias (__modff, modff)
diff --git a/src/nan.c b/src/nan.c
new file mode 100644
index 0000000..fbfc52c
--- /dev/null
+++ b/src/nan.c
@@ -0,0 +1,114 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+#include <stdio.h>
+
+double FN_PROTOTYPE(nan)(const char *tagp)
+{
+
+
+ /* Check for input range */
+ UT64 checkbits;
+ U64 val=0;
+ S64 num;
+ checkbits.u64 =QNANBITPATT_DP64;
+ if(tagp == NULL)
+ {
+ return checkbits.f64;
+ }
+
+ switch(*tagp)
+ {
+ case '0': /* base 8 */
+ tagp++;
+ if( *tagp == 'x' || *tagp == 'X')
+ {
+ /* base 16 */
+ tagp++;
+ while(*tagp != '\0')
+ {
+
+ if(*tagp >= 'A' && *tagp <= 'F' )
+ {
+ num = *tagp - 'A' + 10;
+ }
+ else
+ if(*tagp >= 'a' && *tagp <= 'f' )
+ {
+ num = *tagp - 'a' + 10;
+ }
+ else
+ {
+ num = *tagp - '0';
+ }
+
+ if( (num < 0 || num > 15))
+ {
+ val = QNANBITPATT_DP64;
+ break;
+ }
+ val = (val << 4) | num;
+ tagp++;
+ }
+ }
+ else
+ {
+ /* base 8 */
+ while(*tagp != '\0')
+ {
+ num = *tagp - '0';
+ if( num < 0 || num > 7)
+ {
+ val = QNANBITPATT_DP64;
+ break;
+ }
+ val = (val << 3) | num;
+ tagp++;
+ }
+ }
+ break;
+ default:
+ while(*tagp != '\0')
+ {
+ val = val*10;
+ num = *tagp - '0';
+ if( num < 0 || num > 9)
+ {
+ val = QNANBITPATT_DP64;
+ break;
+ }
+ val = val + num;
+ tagp++;
+ }
+
+ }
+
+ if((val & ~NINFBITPATT_DP64) == 0)
+ val = QNANBITPATT_DP64;
+
+ checkbits.u64 = (val | QNANBITPATT_DP64) & ~SIGNBIT_DP64;
+ return checkbits.f64 ;
+}
+
diff --git a/src/nanf.c b/src/nanf.c
new file mode 100644
index 0000000..8d712f2
--- /dev/null
+++ b/src/nanf.c
@@ -0,0 +1,120 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+#include <stdio.h>
+
+
+float FN_PROTOTYPE(nanf)(const char *tagp)
+{
+
+
+ /* Check for input range */
+ UT32 checkbits;
+ U32 val=0;
+ S32 num;
+ checkbits.u32 =QNANBITPATT_SP32;
+ if(tagp == NULL)
+ return checkbits.f32 ;
+
+
+ switch(*tagp)
+ {
+ case '0': /* base 8 */
+ tagp++;
+ if( *tagp == 'x' || *tagp == 'X')
+ {
+ /* base 16 */
+ tagp++;
+ while(*tagp != '\0')
+ {
+
+ if(*tagp >= 'A' && *tagp <= 'F' )
+ {
+ num = *tagp - 'A' + 10;
+ }
+ else
+ if(*tagp >= 'a' && *tagp <= 'f' )
+ {
+ num = *tagp - 'a' + 10;
+ }
+ else
+ {
+ num = *tagp - '0';
+ }
+
+ if( (num < 0 || num > 15))
+ {
+ val = QNANBITPATT_SP32;
+ break;
+ }
+ val = (val << 4) | num;
+ tagp++;
+ }
+ }
+ else
+ {
+ /* base 8 */
+ while(*tagp != '\0')
+ {
+ num = *tagp - '0';
+ if( num < 0 || num > 7)
+ {
+ val = QNANBITPATT_SP32;
+ break;
+ }
+ val = (val << 3) | num;
+ tagp++;
+ }
+ }
+ break;
+ default:
+ while(*tagp != '\0')
+ {
+ val = val*10;
+ num = *tagp - '0';
+ if( num < 0 || num > 9)
+ {
+ val = QNANBITPATT_SP32;
+ break;
+ }
+ val = val + num;
+ tagp++;
+ }
+
+ }
+
+/* if(val > ~INDEFBITPATT_SP32)
+ val = (val | QNANBITPATT_SP32) & ~SIGNBIT_SP32;
+
+ checkbits.u32 = val | EXPBITS_SP32 ; */
+
+ if((val & ~INDEFBITPATT_SP32) == 0)
+ val = QNANBITPATT_SP32;
+
+ checkbits.u32 = (val | QNANBITPATT_SP32) & ~SIGNBIT_SP32;
+
+
+ return checkbits.f32 ;
+}
diff --git a/src/nearbyintf.c b/src/nearbyintf.c
new file mode 100644
index 0000000..2b656ef
--- /dev/null
+++ b/src/nearbyintf.c
@@ -0,0 +1,51 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+float FN_PROTOTYPE(nearbyintf)(float x)
+{
+ /* Check for input range */
+ UT32 checkbits,sign,val_2p23;
+ checkbits.f32=x;
+
+ /* Clear the sign bit and check if the value can be rounded(i.e check if exponent less than 23) */
+ if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000)
+ {
+ /* take care of nan or inf */
+ if((checkbits.u32 & 0x7f800000)== 0x7f800000)
+ return x+x;
+ else
+ return x;
+ }
+
+ sign.u32 = checkbits.u32 & 0x80000000;
+ val_2p23.u32 = sign.u32 | 0x4B000000;
+ val_2p23.f32 = (x + val_2p23.f32) - val_2p23.f32;
+ /*This extra line is to take care of denormals and various rounding modes*/
+ val_2p23.u32 = ((val_2p23.u32 << 1) >> 1) | sign.u32;
+ return (val_2p23.f32);
+}
+
diff --git a/src/nextafter.c b/src/nextafter.c
new file mode 100644
index 0000000..62d9b5a
--- /dev/null
+++ b/src/nextafter.c
@@ -0,0 +1,91 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#endif
+
+#include <float.h>
+#include <math.h>
+#include <errno.h>
+#include <limits.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+
+double FN_PROTOTYPE(nextafter)(double x, double y)
+{
+
+
+ UT64 checkbits;
+ double dy = y;
+ checkbits.f64=x;
+
+ /* if x == y return y in the type of x */
+ if( x == dy )
+ {
+ return dy;
+ }
+
+ /* check if the number is nan */
+ if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 ))
+ {
+ __amd_handle_error(DOMAIN, ERANGE, "nextafter", x, y , x+x);
+
+ return x+x;
+ }
+
+ if( x == 0.0)
+ {
+ checkbits.u64 = 1;
+ if( dy > 0.0 )
+ return checkbits.f64;
+ else
+ return -checkbits.f64;
+ }
+
+
+ /* compute the next heigher or lower value */
+
+ if(((x>0.0) ^ (dy>x)) == 0)
+ {
+ checkbits.u64++;
+ }
+ else
+ {
+ checkbits.u64--;
+ }
+
+ /* check if the result is nan or inf */
+ if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 ))
+ {
+ __amd_handle_error(DOMAIN, ERANGE, "nextafter", x, y , checkbits.f64);
+
+ }
+
+ return checkbits.f64;
+}
diff --git a/src/nextafterf.c b/src/nextafterf.c
new file mode 100644
index 0000000..019187f
--- /dev/null
+++ b/src/nextafterf.c
@@ -0,0 +1,102 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#endif
+
+
+#include <float.h>
+#include <math.h>
+#include <errno.h>
+#include <limits.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+
+float FN_PROTOTYPE(nextafterf)(float x, float y)
+{
+
+
+ UT32 checkbits;
+ float dy = y;
+ checkbits.f32=x;
+
+ /* if x == y return y in the type of x */
+ if( x == dy )
+ {
+ return dy;
+ }
+
+ /* check if the number is nan */
+ if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 ))
+ {
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, ERANGE, "nextafterf", x, is_x_snan, y , 0,x+x, 0);
+
+ }
+
+ return x+x;
+ }
+
+ if( x == 0.0)
+ {
+ checkbits.u32 = 1;
+ if( dy > 0.0 )
+ return checkbits.f32;
+ else
+ return -checkbits.f32;
+ }
+
+
+ /* compute the next heigher or lower value */
+ if(((x>0.0F) ^ (dy>x)) == 0)
+ {
+ checkbits.u32++;
+ }
+ else
+ {
+ checkbits.u32--;
+ }
+
+ /* check if the result is nan or inf */
+ if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 ))
+ {
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, ERANGE, "nextafterf", x, is_x_snan, y , 0,checkbits.f32, 0);
+
+ }
+ }
+
+ return checkbits.f32;
+}
diff --git a/src/nexttoward.c b/src/nexttoward.c
new file mode 100644
index 0000000..14b2f62
--- /dev/null
+++ b/src/nexttoward.c
@@ -0,0 +1,93 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+
+
+double FN_PROTOTYPE(nexttoward)(double x, long double y)
+{
+
+
+ UT64 checkbits;
+ long double dy = (long double) y;
+ checkbits.f64=x;
+
+ /* if x == y return y in the type of x */
+ if( x == dy )
+ {
+ return (double) dy;
+ }
+
+ /* check if the number is nan */
+ if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 ))
+ {
+
+ __amd_handle_error(DOMAIN, ERANGE, "nexttoward", x, (double)y ,x+x);
+
+
+ return x+x;
+ }
+
+ if( x == 0.0)
+ {
+ checkbits.u64 = 1;
+ if( dy > 0.0 )
+ return checkbits.f64;
+ else
+ return -checkbits.f64;
+ }
+
+
+ /* compute the next heigher or lower value */
+
+ if(((x>0.0) ^ (dy>x)) == 0)
+ {
+ checkbits.u64++;
+ }
+ else
+ {
+ checkbits.u64--;
+ }
+
+ /* check if the result is nan or inf */
+ if(((checkbits.u64 & ~SIGNBIT_DP64) >= EXPBITS_DP64 ))
+ {
+ __amd_handle_error(DOMAIN, ERANGE, "nexttoward", x, (double)y ,checkbits.f64);
+
+
+ }
+
+ return checkbits.f64;
+}
diff --git a/src/nexttowardf.c b/src/nexttowardf.c
new file mode 100644
index 0000000..47b42c7
--- /dev/null
+++ b/src/nexttowardf.c
@@ -0,0 +1,97 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+#include "libm_special.h"
+
+
+float FN_PROTOTYPE(nexttowardf)(float x, long double y)
+{
+
+
+ UT32 checkbits;
+ long double dy = (long double) y;
+ checkbits.f32=x;
+
+ /* if x == y return y in the type of x */
+ if( x == dy )
+ {
+ return (float) dy;
+ }
+
+ /* check if the number is nan */
+ if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 ))
+ {
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, ERANGE, "nexttowardf", x, is_x_snan, (float) y , 0,x+x, 0);
+
+ }
+
+ return x+x;
+ }
+
+ if( x == 0.0)
+ {
+ checkbits.u32 = 1;
+ if( dy > 0.0 )
+ return checkbits.f32;
+ else
+ return -checkbits.f32;
+ }
+
+
+ /* compute the next heigher or lower value */
+ if(((x>0.0F) ^ (dy>x)) == 0)
+ {
+ checkbits.u32++;
+ }
+ else
+ {
+ checkbits.u32--;
+ }
+
+ /* check if the result is nan or inf */
+ if(((checkbits.u32 & ~SIGNBIT_SP32) >= EXPBITS_SP32 ))
+ {
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, ERANGE, "nexttowardf", x, is_x_snan, (float) y , 0,checkbits.f32, 0);
+ }
+ }
+
+ return checkbits.f32;
+}
diff --git a/src/pow_special.c b/src/pow_special.c
new file mode 100644
index 0000000..cb571d2
--- /dev/null
+++ b/src/pow_special.c
@@ -0,0 +1,168 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+#ifdef __x86_64__
+
+#include <emmintrin.h>
+#include <math.h>
+#include <errno.h>
+
+
+
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+// these codes and the ones in the related .S or .asm files have to match
+#define POW_X_ONE_Y_SNAN 1
+#define POW_X_ZERO_Z_INF 2
+#define POW_X_NAN 3
+#define POW_Y_NAN 4
+#define POW_X_NAN_Y_NAN 5
+#define POW_X_NEG_Y_NOTINT 6
+#define POW_Z_ZERO 7
+#define POW_Z_DENORMAL 8
+#define POW_Z_INF 9
+
+float _powf_special(float x, float y, float z, U32 code)
+{
+ switch(code)
+ {
+ case POW_X_ONE_Y_SNAN:
+ {
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+ }
+ break;
+
+ case POW_X_ZERO_Z_INF:
+ {
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO);
+ __amd_handle_errorf(SING, ERANGE, "powf", x, 0, y, 0, z, 0);
+ }
+ break;
+
+ case POW_X_NAN:
+ case POW_Y_NAN:
+ case POW_X_NAN_Y_NAN:
+ {
+#ifdef WIN64
+ unsigned int is_x_snan = 0, is_y_snan = 0, is_z_snan = 0;
+ UT32 xm, ym, zm;
+ xm.f32 = x;
+ ym.f32 = y;
+ zm.f32 = z;
+ if(code == POW_X_NAN) { is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); }
+ if(code == POW_Y_NAN) { is_y_snan = ( ((ym.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); }
+ if(code == POW_X_NAN_Y_NAN) { is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ is_y_snan = ( ((ym.u32 & QNAN_MASK_32) == 0) ? 1 : 0 ); }
+ is_z_snan = ( ((zm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+ __amd_handle_errorf(DOMAIN, EDOM, "powf", x, is_x_snan, y, is_y_snan, z, is_z_snan);
+#else
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+ }
+ break;
+
+ case POW_X_NEG_Y_NOTINT:
+ {
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+ __amd_handle_errorf(DOMAIN, EDOM, "powf", x, 0, y, 0, z, 0);
+ }
+ break;
+
+ case POW_Z_ZERO:
+ {
+ _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW));
+ __amd_handle_errorf(UNDERFLOW, ERANGE, "powf", x, 0, y, 0, z, 0);
+ }
+ break;
+
+ case POW_Z_INF:
+ {
+ _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW));
+ __amd_handle_errorf(OVERFLOW, ERANGE, "powf", x, 0, y, 0, z, 0);
+ }
+ break;
+ }
+
+ return z;
+}
+
+double _pow_special(double x, double y, double z, U32 code)
+{
+ switch(code)
+ {
+ case POW_X_ONE_Y_SNAN:
+ {
+#ifdef WIN64
+#else
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+ }
+ break;
+
+ case POW_X_ZERO_Z_INF:
+ {
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_DIVBYZERO);
+ __amd_handle_error(SING, ERANGE, "pow", x, y, z);
+ }
+ break;
+
+ case POW_X_NAN:
+ case POW_Y_NAN:
+ case POW_X_NAN_Y_NAN:
+ {
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#ifdef WIN64
+ __amd_handle_error(DOMAIN, EDOM, "pow", x, y, z);
+#endif
+ }
+ break;
+
+ case POW_X_NEG_Y_NOTINT:
+ {
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+ __amd_handle_error(DOMAIN, EDOM, "pow", x, y, z);
+ }
+ break;
+
+ case POW_Z_ZERO:
+ case POW_Z_DENORMAL:
+ {
+ _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_UNDERFLOW));
+ __amd_handle_error(UNDERFLOW, ERANGE, "pow", x, y, z);
+ }
+ break;
+
+ case POW_Z_INF:
+ {
+ _mm_setcsr(_mm_getcsr() | (MXCSR_ES_INEXACT|MXCSR_ES_OVERFLOW));
+ __amd_handle_error(OVERFLOW, ERANGE, "pow", x, y, z);
+ }
+ break;
+ }
+
+ return z;
+}
+
+#endif /* __x86_64__ */
diff --git a/src/remainder_piby2.c b/src/remainder_piby2.c
new file mode 100644
index 0000000..3f6676f
--- /dev/null
+++ b/src/remainder_piby2.c
@@ -0,0 +1,331 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#define EXPBITS_DP64 0x7ff0000000000000
+#define EXPSHIFTBITS_DP64 52
+#define EXPBIAS_DP64 1023
+#define MANTBITS_DP64 0x000fffffffffffff
+#define IMPBIT_DP64 0x0010000000000000
+#define SIGNBIT_DP64 0x8000000000000000
+
+
+#define GET_BITS_DP64(x, ux) \
+ { \
+ volatile union {double d; unsigned long long i;} _bitsy; \
+ _bitsy.d = (x); \
+ ux = _bitsy.i; \
+ }
+
+#define PUT_BITS_DP64(ux, x) \
+ { \
+ volatile union {double d; unsigned long long i;} _bitsy; \
+ _bitsy.i = (ux); \
+ x = _bitsy.d; \
+ }
+
+/* Define this to get debugging print statements activated */
+#define DEBUGGING_PRINT
+#undef DEBUGGING_PRINT
+
+
+#ifdef DEBUGGING_PRINT
+#include <stdio.h>
+char *d2b(int d, int bitsper, int point)
+{
+ static char buff[50];
+ int i, j;
+ j = bitsper;
+ if (point >= 0 && point <= bitsper)
+ j++;
+ buff[j] = '\0';
+ for (i = bitsper - 1; i >= 0; i--)
+ {
+ j--;
+ if (d % 2 == 1)
+ buff[j] = '1';
+ else
+ buff[j] = '0';
+ if (i == point)
+ {
+ j--;
+ buff[j] = '.';
+ }
+ d /= 2;
+ }
+ return buff;
+}
+#endif
+
+/* Given positive argument x, reduce it to the range [-pi/4,pi/4] using
+ extra precision, and return the result in r, rr.
+ Return value "region" tells how many lots of pi/2 were subtracted
+ from x to put it in the range [-pi/4,pi/4], mod 4. */
+void __amd_remainder_piby2(double x, double *r, double *rr, int *region)
+{
+
+ /* This method simulates multi-precision floating-point
+ arithmetic and is accurate for all 1 <= x < infinity */
+ static const double
+ piby2_lead = 1.57079632679489655800e+00, /* 0x3ff921fb54442d18 */
+ piby2_part1 = 1.57079631090164184570e+00, /* 0x3ff921fb50000000 */
+ piby2_part2 = 1.58932547122958567343e-08, /* 0x3e5110b460000000 */
+ piby2_part3 = 6.12323399573676480327e-17; /* 0x3c91a62633145c06 */
+ const int bitsper = 10;
+ unsigned long long res[500];
+ unsigned long long ux, u, carry, mask, mant, highbitsrr;
+ int first, last, i, rexp, xexp, resexp, ltb, determ;
+ double xx, t;
+ static unsigned long long pibits[] =
+ {
+ 0, 0, 0, 0, 0, 0,
+ 162, 998, 54, 915, 580, 84, 671, 777, 855, 839,
+ 851, 311, 448, 877, 553, 358, 316, 270, 260, 127,
+ 593, 398, 701, 942, 965, 390, 882, 283, 570, 265,
+ 221, 184, 6, 292, 750, 642, 465, 584, 463, 903,
+ 491, 114, 786, 617, 830, 930, 35, 381, 302, 749,
+ 72, 314, 412, 448, 619, 279, 894, 260, 921, 117,
+ 569, 525, 307, 637, 156, 529, 504, 751, 505, 160,
+ 945, 1022, 151, 1023, 480, 358, 15, 956, 753, 98,
+ 858, 41, 721, 987, 310, 507, 242, 498, 777, 733,
+ 244, 399, 870, 633, 510, 651, 373, 158, 940, 506,
+ 997, 965, 947, 833, 825, 990, 165, 164, 746, 431,
+ 949, 1004, 287, 565, 464, 533, 515, 193, 111, 798
+ };
+
+ GET_BITS_DP64(x, ux);
+
+#ifdef DEBUGGING_PRINT
+ printf("On entry, x = %25.20e = %s\n", x, double2hex(&x));
+#endif
+
+ xexp = (int)(((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64);
+ ux = (ux & MANTBITS_DP64) | IMPBIT_DP64;
+
+ /* Now ux is the mantissa bit pattern of x as a long integer */
+ carry = 0;
+ mask = 1;
+ mask = (mask << bitsper) - 1;
+
+ /* Set first and last to the positions of the first
+ and last chunks of 2/pi that we need */
+ first = xexp / bitsper;
+ resexp = xexp - first * bitsper;
+ /* 180 is the theoretical maximum number of bits (actually
+ 175 for IEEE double precision) that we need to extract
+ from the middle of 2/pi to compute the reduced argument
+ accurately enough for our purposes */
+ last = first + 180 / bitsper;
+
+ /* Do a long multiplication of the bits of 2/pi by the
+ integer mantissa */
+ /* Unroll the loop. This is only correct because we know
+ that bitsper is fixed as 10. */
+ res[19] = 0;
+ u = pibits[last] * ux;
+ res[18] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-1] * ux + carry;
+ res[17] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-2] * ux + carry;
+ res[16] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-3] * ux + carry;
+ res[15] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-4] * ux + carry;
+ res[14] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-5] * ux + carry;
+ res[13] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-6] * ux + carry;
+ res[12] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-7] * ux + carry;
+ res[11] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-8] * ux + carry;
+ res[10] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-9] * ux + carry;
+ res[9] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-10] * ux + carry;
+ res[8] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-11] * ux + carry;
+ res[7] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-12] * ux + carry;
+ res[6] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-13] * ux + carry;
+ res[5] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-14] * ux + carry;
+ res[4] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-15] * ux + carry;
+ res[3] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-16] * ux + carry;
+ res[2] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-17] * ux + carry;
+ res[1] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-18] * ux + carry;
+ res[0] = u & mask;
+
+#ifdef DEBUGGING_PRINT
+ printf("resexp = %d\n", resexp);
+ printf("Significant part of x * 2/pi with binary"
+ " point in correct place:\n");
+ for (i = 0; i <= last - first; i++)
+ {
+ if (i > 0 && i % 5 == 0)
+ printf("\n ");
+ if (i == 1)
+ printf("%s ", d2b((int)res[i], bitsper, resexp));
+ else
+ printf("%s ", d2b((int)res[i], bitsper, -1));
+ }
+ printf("\n");
+#endif
+
+ /* Reconstruct the result */
+ ltb = (int)((((res[0] << bitsper) | res[1])
+ >> (bitsper - 1 - resexp)) & 7);
+
+ /* determ says whether the fractional part is >= 0.5 */
+ determ = ltb & 1;
+
+#ifdef DEBUGGING_PRINT
+ printf("ltb = %d (last two bits before binary point"
+ " and first bit after)\n", ltb);
+ printf("determ = %d (1 means need to negate because the fractional\n"
+ " part of x * 2/pi is greater than 0.5)\n", determ);
+#endif
+
+ i = 1;
+ if (determ)
+ {
+ /* The mantissa is >= 0.5. We want to subtract it
+ from 1.0 by negating all the bits */
+ *region = ((ltb >> 1) + 1) & 3;
+ mant = 1;
+ mant = ~(res[1]) & ((mant << (bitsper - resexp)) - 1);
+ while (mant < 0x0020000000000000)
+ {
+ i++;
+ mant = (mant << bitsper) | (~(res[i]) & mask);
+ }
+ highbitsrr = ~(res[i + 1]) << (64 - bitsper);
+ }
+ else
+ {
+ *region = (ltb >> 1);
+ mant = 1;
+ mant = res[1] & ((mant << (bitsper - resexp)) - 1);
+ while (mant < 0x0020000000000000)
+ {
+ i++;
+ mant = (mant << bitsper) | res[i];
+ }
+ highbitsrr = res[i + 1] << (64 - bitsper);
+ }
+
+ rexp = 52 + resexp - i * bitsper;
+
+ while (mant >= 0x0020000000000000)
+ {
+ rexp++;
+ highbitsrr = (highbitsrr >> 1) | ((mant & 1) << 63);
+ mant >>= 1;
+ }
+
+#ifdef DEBUGGING_PRINT
+ printf("Normalised mantissa = 0x%016lx\n", mant);
+ printf("High bits of rest of mantissa = 0x%016lx\n", highbitsrr);
+ printf("Exponent to be inserted on mantissa = rexp = %d\n", rexp);
+#endif
+
+ /* Put the result exponent rexp onto the mantissa pattern */
+ u = ((unsigned long long)rexp + EXPBIAS_DP64) << EXPSHIFTBITS_DP64;
+ ux = (mant & MANTBITS_DP64) | u;
+ if (determ)
+ /* If we negated the mantissa we negate x too */
+ ux |= SIGNBIT_DP64;
+ PUT_BITS_DP64(ux, x);
+
+ /* Create the bit pattern for rr */
+ highbitsrr >>= 12; /* Note this is shifted one place too far */
+ u = ((unsigned long long)rexp + EXPBIAS_DP64 - 53) << EXPSHIFTBITS_DP64;
+ PUT_BITS_DP64(u, t);
+ u |= highbitsrr;
+ PUT_BITS_DP64(u, xx);
+
+ /* Subtract the implicit bit we accidentally added */
+ xx -= t;
+ /* Set the correct sign, and double to account for the
+ "one place too far" shift */
+ if (determ)
+ xx *= -2.0;
+ else
+ xx *= 2.0;
+
+#ifdef DEBUGGING_PRINT
+ printf("(lead part of x*2/pi) = %25.20e = %s\n", x, double2hex(&x));
+ printf("(tail part of x*2/pi) = %25.20e = %s\n", xx, double2hex(&xx));
+#endif
+
+ /* (x,xx) is an extra-precise version of the fractional part of
+ x * 2 / pi. Multiply (x,xx) by pi/2 in extra precision
+ to get the reduced argument (r,rr). */
+ {
+ double hx, tx, c, cc;
+ /* Split x into hx (head) and tx (tail) */
+ GET_BITS_DP64(x, ux);
+ ux &= 0xfffffffff8000000;
+ PUT_BITS_DP64(ux, hx);
+ tx = x - hx;
+
+ c = piby2_lead * x;
+ cc = ((((piby2_part1 * hx - c) + piby2_part1 * tx) +
+ piby2_part2 * hx) + piby2_part2 * tx) +
+ (piby2_lead * xx + piby2_part3 * x);
+ *r = c + cc;
+ *rr = (c - *r) + cc;
+ }
+
+#ifdef DEBUGGING_PRINT
+ printf(" (r,rr) = lead and tail parts of frac(x*2/pi) * pi/2:\n");
+ printf(" r = %25.20e = %s\n", *r, double2hex(r));
+ printf("rr = %25.20e = %s\n", *rr, double2hex(rr));
+ printf("region = (number of pi/2 subtracted from x) mod 4 = %d\n",
+ *region);
+#endif
+ return;
+}
diff --git a/src/remainder_piby2d2f.c b/src/remainder_piby2d2f.c
new file mode 100644
index 0000000..59ed44a
--- /dev/null
+++ b/src/remainder_piby2d2f.c
@@ -0,0 +1,217 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#define EXPBITS_DP64 0x7ff0000000000000
+#define EXPSHIFTBITS_DP64 52
+#define EXPBIAS_DP64 1023
+#define MANTBITS_DP64 0x000fffffffffffff
+#define IMPBIT_DP64 0x0010000000000000
+#define SIGNBIT_DP64 0x8000000000000000
+
+#define PUT_BITS_DP64(ux, x) \
+ { \
+ volatile union {double d; unsigned long long i;} _bitsy; \
+ _bitsy.i = (ux); \
+ x = _bitsy.d; \
+ }
+
+/*Derived from static inline void __amd_remainder_piby2f_inline(unsigned long long ux, double *r, int *region)
+in libm_inlines_amd.h. libm_inlines.h has the pure Windows one while libm_inlines_amd.h has the mixed one.
+*/
+/* Given positive argument x, reduce it to the range [-pi/4,pi/4] using
+ extra precision, and return the result in r.
+ Return value "region" tells how many lots of pi/2 were subtracted
+ from x to put it in the range [-pi/4,pi/4], mod 4. */
+void __amd_remainder_piby2d2f(unsigned long long ux, double *r, int *region)
+{
+ /* This method simulates multi-precision floating-point
+ arithmetic and is accurate for all 1 <= x < infinity */
+ unsigned long long u, carry, mask, mant, highbitsrr;
+ double dx;
+ unsigned long long res[500];
+ int first, last, i, rexp, xexp, resexp, ltb, determ;
+ static const double
+ piby2 = 1.57079632679489655800e+00; /* 0x3ff921fb54442d18 */
+ const int bitsper = 10;
+ static unsigned long long pibits[] =
+ {
+ 0, 0, 0, 0, 0, 0,
+ 162, 998, 54, 915, 580, 84, 671, 777, 855, 839,
+ 851, 311, 448, 877, 553, 358, 316, 270, 260, 127,
+ 593, 398, 701, 942, 965, 390, 882, 283, 570, 265,
+ 221, 184, 6, 292, 750, 642, 465, 584, 463, 903,
+ 491, 114, 786, 617, 830, 930, 35, 381, 302, 749,
+ 72, 314, 412, 448, 619, 279, 894, 260, 921, 117,
+ 569, 525, 307, 637, 156, 529, 504, 751, 505, 160,
+ 945, 1022, 151, 1023, 480, 358, 15, 956, 753, 98,
+ 858, 41, 721, 987, 310, 507, 242, 498, 777, 733,
+ 244, 399, 870, 633, 510, 651, 373, 158, 940, 506,
+ 997, 965, 947, 833, 825, 990, 165, 164, 746, 431,
+ 949, 1004, 287, 565, 464, 533, 515, 193, 111, 798
+ };
+
+ xexp = (int)(((ux & EXPBITS_DP64) >> EXPSHIFTBITS_DP64) - EXPBIAS_DP64);
+ ux = (ux & MANTBITS_DP64) | IMPBIT_DP64;
+
+ /* Now ux is the mantissa bit pattern of x as a long integer */
+ mask = 1;
+ mask = (mask << bitsper) - 1;
+
+ /* Set first and last to the positions of the first
+ and last chunks of 2/pi that we need */
+ first = xexp / bitsper;
+ resexp = xexp - first * bitsper;
+ /* 180 is the theoretical maximum number of bits (actually
+ 175 for IEEE double precision) that we need to extract
+ from the middle of 2/pi to compute the reduced argument
+ accurately enough for our purposes */
+ last = first + 180 / bitsper;
+
+ /* Do a long multiplication of the bits of 2/pi by the
+ integer mantissa */
+ /* Unroll the loop. This is only correct because we know
+ that bitsper is fixed as 10. */
+ res[19] = 0;
+ u = pibits[last] * ux;
+ res[18] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-1] * ux + carry;
+ res[17] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-2] * ux + carry;
+ res[16] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-3] * ux + carry;
+ res[15] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-4] * ux + carry;
+ res[14] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-5] * ux + carry;
+ res[13] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-6] * ux + carry;
+ res[12] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-7] * ux + carry;
+ res[11] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-8] * ux + carry;
+ res[10] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-9] * ux + carry;
+ res[9] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-10] * ux + carry;
+ res[8] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-11] * ux + carry;
+ res[7] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-12] * ux + carry;
+ res[6] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-13] * ux + carry;
+ res[5] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-14] * ux + carry;
+ res[4] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-15] * ux + carry;
+ res[3] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-16] * ux + carry;
+ res[2] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-17] * ux + carry;
+ res[1] = u & mask;
+ carry = u >> bitsper;
+ u = pibits[last-18] * ux + carry;
+ res[0] = u & mask;
+
+ /* Reconstruct the result */
+ ltb = (int)((((res[0] << bitsper) | res[1])
+ >> (bitsper - 1 - resexp)) & 7);
+
+ /* determ says whether the fractional part is >= 0.5 */
+ determ = ltb & 1;
+
+ i = 1;
+ if (determ)
+ {
+ /* The mantissa is >= 0.5. We want to subtract it
+ from 1.0 by negating all the bits */
+ *region = ((ltb >> 1) + 1) & 3;
+ mant = 1;
+ mant = ~(res[1]) & ((mant << (bitsper - resexp)) - 1);
+ while (mant < 0x0020000000000000)
+ {
+ i++;
+ mant = (mant << bitsper) | (~(res[i]) & mask);
+ }
+ highbitsrr = ~(res[i + 1]) << (64 - bitsper);
+ }
+ else
+ {
+ *region = (ltb >> 1);
+ mant = 1;
+ mant = res[1] & ((mant << (bitsper - resexp)) - 1);
+ while (mant < 0x0020000000000000)
+ {
+ i++;
+ mant = (mant << bitsper) | res[i];
+ }
+ highbitsrr = res[i + 1] << (64 - bitsper);
+ }
+
+ rexp = 52 + resexp - i * bitsper;
+
+ while (mant >= 0x0020000000000000)
+ {
+ rexp++;
+ highbitsrr = (highbitsrr >> 1) | ((mant & 1) << 63);
+ mant >>= 1;
+ }
+
+ /* Put the result exponent rexp onto the mantissa pattern */
+ u = ((unsigned long long)rexp + EXPBIAS_DP64) << EXPSHIFTBITS_DP64;
+ ux = (mant & MANTBITS_DP64) | u;
+ if (determ)
+ /* If we negated the mantissa we negate x too */
+ ux |= SIGNBIT_DP64;
+ PUT_BITS_DP64(ux, dx);
+
+ /* x is a double precision version of the fractional part of
+ x * 2 / pi. Multiply x by pi/2 in double precision
+ to get the reduced argument r. */
+ *r = dx * piby2;
+
+ return;
+}
+
+void __remainder_piby2d2f(unsigned long ux, double *r, int *region)
+{
+ __amd_remainder_piby2d2f((unsigned long long) ux, r, region);
+}
+
diff --git a/src/rint.c b/src/rint.c
new file mode 100644
index 0000000..770685f
--- /dev/null
+++ b/src/rint.c
@@ -0,0 +1,69 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+
+
+
+double FN_PROTOTYPE(rint)(double x)
+{
+
+ UT64 checkbits,val_2p52;
+ UT32 sign;
+ checkbits.f64=x;
+
+ /* Clear the sign bit and check if the value can be rounded(i.e check if exponent less than 52) */
+ if( (checkbits.u64 & 0x7FFFFFFFFFFFFFFF) > 0x4330000000000000)
+ {
+ /* take care of nan or inf */
+ if((checkbits.u32[1] & 0x7ff00000)== 0x7ff00000)
+ return x+x;
+ else
+ return x;
+ }
+
+ sign.u32 = checkbits.u32[1] & 0x80000000;
+ val_2p52.u32[1] = sign.u32 | 0x43300000;
+ val_2p52.u32[0] = 0;
+
+ /* Add and sub 2^52 to round the number according to the current rounding direction */
+ val_2p52.f64 = (x + val_2p52.f64) - val_2p52.f64;
+
+ /*This extra line is to take care of denormals and various rounding modes*/
+ val_2p52.u32[1] = ((val_2p52.u32[1] << 1) >> 1) | sign.u32;
+
+ if(x!=val_2p52.f64)
+ {
+ /* Raise floating-point inexact exception if the result differs in value from the argument */
+ checkbits.u64 = QNANBITPATT_DP64;
+ checkbits.f64 = checkbits.f64 + checkbits.f64; /* raise inexact exception by adding two nan numbers.*/
+ }
+
+
+ return (val_2p52.f64);
+}
+
+
+
+
diff --git a/src/rintf.c b/src/rintf.c
new file mode 100644
index 0000000..e048c11
--- /dev/null
+++ b/src/rintf.c
@@ -0,0 +1,65 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "libm_amd.h"
+#include "libm_util_amd.h"
+
+
+
+float FN_PROTOTYPE(rintf)(float x)
+{
+
+ UT32 checkbits,sign,val_2p23;
+ checkbits.f32=x;
+
+ /* Clear the sign bit and check if the value can be rounded */
+ if( (checkbits.u32 & 0x7FFFFFFF) > 0x4B000000)
+ {
+ /* Number exceeds the representable range could be nan or inf also*/
+ /* take care of nan or inf */
+ if((checkbits.u32 & 0x7f800000)== 0x7f800000)
+ return x+x;
+ else
+ return x;
+ }
+
+ sign.u32 = checkbits.u32 & 0x80000000;
+ val_2p23.u32 = (checkbits.u32 & 0x80000000) | 0x4B000000;
+
+ /* Add and sub 2^23 to round the number according to the current rounding direction */
+ val_2p23.f32 = ((x + val_2p23.f32) - val_2p23.f32);
+
+ /*This extra line is to take care of denormals and various rounding modes*/
+ val_2p23.u32 = ((val_2p23.u32 << 1) >> 1) | sign.u32;
+
+ if (val_2p23.f32 != x)
+ {
+ /* Raise floating-point inexact exception if the result differs in value from the argument */
+ checkbits.u32 = 0xFFC00000;
+ checkbits.f32 = checkbits.f32 + checkbits.f32; /* raise inexact exception by adding two nan numbers.*/
+ }
+
+
+ return val_2p23.f32;
+}
+
diff --git a/src/roundf.c b/src/roundf.c
new file mode 100644
index 0000000..596c381
--- /dev/null
+++ b/src/roundf.c
@@ -0,0 +1,97 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+float FN_PROTOTYPE(roundf)(float f)
+{
+ UT32 u32f, u32Temp;
+ U32 u32sign, u32exp, u32mantissa;
+ int intexp; /*Needs to be signed */
+ u32f.f32 = f;
+ u32sign = u32f.u32 & SIGNBIT_SP32;
+ if ((u32f.u32 & 0X7F800000) == 0x7F800000)
+ {
+ //u32f.f32 = f;
+ /*Return Quiet Nan.
+ * Quiet the signalling nan*/
+ if(!((u32f.u32 & MANTBITS_SP32) == 0))
+ u32f.u32 |= QNAN_MASK_32;
+ /*else the number is infinity*/
+ //Raise range or domain error
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = f;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "roundf", f, is_x_snan, 0.0F , 0,u32f.f32, 0);
+ }
+
+
+ return u32f.f32;
+ }
+ /*Get the exponent of the input*/
+ intexp = (u32f.u32 & 0x7f800000) >> 23;
+ intexp -= 0x7F;
+ /*If exponent is greater than 22 then the number is already
+ rounded*/
+ if (intexp > 22)
+ return f;
+ if (intexp < 0)
+ {
+ u32Temp.f32 = f;
+ u32Temp.u32 &= 0x7FFFFFFF;
+ /*Add with a large number (2^23 +1) = 8388609.0F
+ to force an overflow*/
+ u32Temp.f32 = (u32Temp.f32 + 8388609.0F);
+ /*Substract back with t he large number*/
+ u32Temp.f32 -= 8388609;
+ if (u32sign)
+ u32Temp.u32 |= 0x80000000;
+ return u32Temp.f32;
+ }
+ else
+ {
+ /*if(intexp == -1)
+ u32exp = 0x3F800000; */
+ u32f.u32 &= 0x7FFFFFFF;
+ u32f.f32 += 0.5;
+ u32exp = u32f.u32 & 0x7F800000;
+ /*right shift then left shift to discard the decimal
+ places*/
+ u32mantissa = (u32f.u32 & MANTBITS_SP32) >> (23 - intexp);
+ u32mantissa = u32mantissa << (23 - intexp);
+ u32Temp.u32 = u32sign | u32exp | u32mantissa;
+ return (u32Temp.f32);
+ }
+}
+
diff --git a/src/scalbln.c b/src/scalbln.c
new file mode 100644
index 0000000..51499d8
--- /dev/null
+++ b/src/scalbln.c
@@ -0,0 +1,119 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+
+double FN_PROTOTYPE(scalbln)(double x, long int n)
+{
+ UT64 val;
+ unsigned int sign;
+ int exponent;
+ val.f64 = x;
+ sign = val.u32[1] & 0x80000000;
+ val.u32[1] = val.u32[1] & 0x7fffffff; /* remove the sign bit */
+
+ if((val.u32[1] & 0x7ff00000)== 0x7ff00000)/* x= nan or x = +-inf*/
+ return x+x;
+
+ if((val.u64 == 0x0000000000000000) || (n==0))
+ return x; /* x= +-0 or n= 0*/
+
+ exponent = val.u32[1] >> 20; /* get the exponent */
+
+ if(exponent == 0)/*x is denormal*/
+ {
+ val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/
+ exponent = val.u32[1] >> 20; /* get the exponent */
+ exponent = exponent + n - MULTIPLIER_DP;
+ if(exponent < -MULTIPLIER_DP)/*underflow*/
+ {
+ val.u32[1] = sign | 0x00000000;
+ val.u32[0] = 0x00000000;
+
+ __amd_handle_error(UNDERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64);
+
+ return val.f64;
+ }
+ if(exponent > 2046)/*overflow*/
+ {
+ val.u32[1] = sign | 0x7ff00000;
+ val.u32[0] = 0x00000000;
+
+ __amd_handle_error(OVERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64);
+
+
+ return val.f64;
+ }
+
+ exponent += MULTIPLIER_DP;
+ val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+ val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+ return val.f64;
+ }
+
+ exponent += n;
+
+ if(exponent < -MULTIPLIER_DP)/*underflow*/
+ {
+ val.u32[1] = sign | 0x00000000;
+ val.u32[0] = 0x00000000;
+
+ __amd_handle_error(UNDERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64);
+
+ return val.f64;
+ }
+
+ if(exponent < 1)/*x is normal but output is debnormal*/
+ {
+ exponent += MULTIPLIER_DP;
+ val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+ val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+ return val.f64;
+ }
+
+ if(exponent > 2046)/*overflow*/
+ {
+ val.u32[1] = sign | 0x7ff00000;
+ val.u32[0] = 0x00000000;
+
+ __amd_handle_error(OVERFLOW, ERANGE, "scalbln", x, (double)n ,val.f64);
+
+
+ return val.f64;
+ }
+
+ val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+ return val.f64;
+}
+
diff --git a/src/scalblnf.c b/src/scalblnf.c
new file mode 100644
index 0000000..cc627bb
--- /dev/null
+++ b/src/scalblnf.c
@@ -0,0 +1,133 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+float FN_PROTOTYPE(scalblnf)(float x, long int n)
+{
+ UT32 val;
+ unsigned int sign;
+ int exponent;
+ val.f32 = x;
+ sign = val.u32 & 0x80000000;
+ val.u32 = val.u32 & 0x7fffffff;/* remove the sign bit */
+
+ if((val.u32 & 0x7f800000)== 0x7f800000)/* x= nan or x = +-inf*/
+ return x+x;
+
+ if((val.u32 == 0x00000000) || (n==0))/* x= +-0 or n= 0*/
+ return x;
+
+ exponent = val.u32 >> 23; /* get the exponent */
+
+ if(exponent == 0)/*x is denormal*/
+ {
+ val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/
+ exponent = (val.u32 >> 23); /* get the exponent */
+ exponent = exponent + n - MULTIPLIER_SP;
+ if(exponent < -MULTIPLIER_SP)/*underflow*/
+ {
+ val.u32 = sign | 0x00000000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(UNDERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0);
+ }
+
+ return val.f32;
+ }
+ if(exponent > 254)/*overflow*/
+ {
+ val.u32 = sign | 0x7f800000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(OVERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0);
+ }
+
+ return val.f32;
+ }
+
+ exponent += MULTIPLIER_SP;
+ val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+ val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+ return val.f32;
+ }
+
+ exponent += n;
+
+ if(exponent < -MULTIPLIER_SP)/*underflow*/
+ {
+ val.u32 = sign | 0x00000000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(UNDERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0);
+ }
+
+ return val.f32;
+ }
+
+ if(exponent < 1)/*x is normal but output is debnormal*/
+ {
+ exponent += MULTIPLIER_SP;
+ val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+ val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+ return val.f32;
+ }
+
+ if(exponent > 254)/*overflow*/
+ {
+ val.u32 = sign | 0x7f800000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(OVERFLOW, ERANGE, "scalblnf", x, is_x_snan, (float)n, 0,val.f32, 0);
+ }
+
+ return val.f32;
+ }
+
+ val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);/*x is normal and output is normal*/
+ return val.f32;
+}
+
diff --git a/src/scalbn.c b/src/scalbn.c
new file mode 100644
index 0000000..facb718
--- /dev/null
+++ b/src/scalbn.c
@@ -0,0 +1,117 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+
+
+double FN_PROTOTYPE(scalbn)(double x, int n)
+{
+ UT64 val;
+ unsigned int sign;
+ int exponent;
+ val.f64 = x;
+ sign = val.u32[1] & 0x80000000;
+ val.u32[1] = val.u32[1] & 0x7fffffff; /* remove the sign bit */
+
+ if((val.u32[1] & 0x7ff00000)== 0x7ff00000)/* x= nan or x = +-inf*/
+ return x+x;
+
+ if((val.u64 == 0x0000000000000000) || (n==0))
+ return x; /* x= +-0 or n= 0*/
+
+ exponent = val.u32[1] >> 20; /* get the exponent */
+
+ if(exponent == 0)/*x is denormal*/
+ {
+ val.f64 = val.f64 * VAL_2PMULTIPLIER_DP;/*multiply by 2^53 to bring it to the normal range*/
+ exponent = val.u32[1] >> 20; /* get the exponent */
+ exponent = exponent + n - MULTIPLIER_DP;
+ if(exponent < -MULTIPLIER_DP)/*underflow*/
+ {
+ val.u32[1] = sign | 0x00000000;
+ val.u32[0] = 0x00000000;
+ __amd_handle_error(UNDERFLOW, ERANGE, "scalbn", x, (double) n , val.f64);
+
+ return val.f64;
+ }
+ if(exponent > 2046)/*overflow*/
+ {
+ val.u32[1] = sign | 0x7ff00000;
+ val.u32[0] = 0x00000000;
+
+ __amd_handle_error(OVERFLOW, ERANGE, "scalbn", x, (double) n , val.f64);
+
+ return val.f64;
+ }
+
+ exponent += MULTIPLIER_DP;
+ val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+ val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+ return val.f64;
+ }
+
+ exponent += n;
+
+ if(exponent < -MULTIPLIER_DP)/*underflow*/
+ {
+ val.u32[1] = sign | 0x00000000;
+ val.u32[0] = 0x00000000;
+
+ __amd_handle_error(UNDERFLOW, ERANGE, "scalbn", x, (double) n , val.f64);
+
+ return val.f64;
+ }
+
+ if(exponent < 1)/*x is normal but output is debnormal*/
+ {
+ exponent += MULTIPLIER_DP;
+ val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+ val.f64 = val.f64 * VAL_2PMMULTIPLIER_DP;
+ return val.f64;
+ }
+
+ if(exponent > 2046)/*overflow*/
+ {
+ val.u32[1] = sign | 0x7ff00000;
+ val.u32[0] = 0x00000000;
+
+ __amd_handle_error(OVERFLOW, ERANGE, "scalbn", x, (double) n , val.f64);
+
+ return val.f64;
+ }
+
+ val.u32[1] = sign | (exponent << 20) | (val.u32[1] & 0x000fffff);
+ return val.f64;
+}
+
diff --git a/src/scalbnf.c b/src/scalbnf.c
new file mode 100644
index 0000000..1477fe1
--- /dev/null
+++ b/src/scalbnf.c
@@ -0,0 +1,138 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+
+#include <math.h>
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#include "../inc/libm_special.h"
+
+float FN_PROTOTYPE(scalbnf)(float x, int n)
+{
+ UT32 val;
+ unsigned int sign;
+ int exponent;
+ val.f32 = x;
+ sign = val.u32 & 0x80000000;
+ val.u32 = val.u32 & 0x7fffffff;/* remove the sign bit */
+
+ if((val.u32 & 0x7f800000)== 0x7f800000)/* x= nan or x = +-inf*/
+ return x+x;
+
+ if((val.u32 == 0x00000000) || (n==0))/* x= +-0 or n= 0*/
+ return x;
+
+ exponent = val.u32 >> 23; /* get the exponent */
+
+ if(exponent == 0)/*x is denormal*/
+ {
+ val.f32 = val.f32 * VAL_2PMULTIPLIER_SP;/*multiply by 2^24 to bring it to the normal range*/
+ exponent = (val.u32 >> 23); /* get the exponent */
+ exponent = exponent + n - MULTIPLIER_SP;
+ if(exponent < -MULTIPLIER_SP)/*underflow*/
+ {
+ val.u32 = sign | 0x00000000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(UNDERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0);
+
+ }
+
+ return val.f32;
+ }
+ if(exponent > 254)/*overflow*/
+ {
+ val.u32 = sign | 0x7f800000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(OVERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0);
+
+ }
+
+
+ return val.f32;
+ }
+
+ exponent += MULTIPLIER_SP;
+ val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+ val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+ return val.f32;
+ }
+
+ exponent += n;
+
+ if(exponent < -MULTIPLIER_SP)/*underflow*/
+ {
+ val.u32 = sign | 0x00000000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(UNDERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0);
+
+ }
+
+
+ return val.f32;
+ }
+
+ if(exponent < 1)/*x is normal but output is debnormal*/
+ {
+ exponent += MULTIPLIER_SP;
+ val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);
+ val.f32 = val.f32 * VAL_2PMMULTIPLIER_SP;
+ return val.f32;
+ }
+
+ if(exponent > 254)/*overflow*/
+ {
+ val.u32 = sign | 0x7f800000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(OVERFLOW, ERANGE, "scalbnf", x, is_x_snan, (float) n , 0, val.f32, 0);
+
+ }
+
+ return val.f32;
+ }
+
+ val.u32 = sign | (exponent << 23) | (val.u32 & 0x007fffff);/*x is normal and output is normal*/
+ return val.f32;
+}
+
diff --git a/src/sincos_special.c b/src/sincos_special.c
new file mode 100644
index 0000000..c349d10
--- /dev/null
+++ b/src/sincos_special.c
@@ -0,0 +1,151 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include <emmintrin.h>
+#include <math.h>
+#include <errno.h>
+
+
+#include "../inc/libm_util_amd.h"
+#include "../inc/libm_special.h"
+
+double _sin_cos_special(double x, const char *name)
+{
+ UT64 xu;
+ unsigned int is_snan;
+
+ xu.f64 = x;
+
+ if((xu.u64 & EXPBITS_DP64) == EXPBITS_DP64)
+ {
+ // x is Inf or NaN
+ if((xu.u64 & MANTBITS_DP64) == 0x0)
+ {
+ // x is Inf
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#ifdef WIN64
+ xu.u64 = INDEFBITPATT_DP64;
+ __amd_handle_error(DOMAIN, EDOM, name, x, 0, xu.f64);
+#else
+ xu.u64 = QNANBITPATT_DP64;
+ name = *(&name); // dummy statement to avoid warning
+#endif
+ }
+ else {
+ // x is NaN
+ is_snan = (((xu.u64 & QNAN_MASK_64) == QNAN_MASK_64) ? 0 : 1);
+ if(is_snan){
+ xu.u64 |= QNAN_MASK_64;
+#ifdef WIN64
+#else
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#endif
+ }
+#ifdef WIN64
+ __amd_handle_error(DOMAIN, EDOM, name, x, 0, xu.f64);
+#endif
+ }
+
+ }
+
+ return xu.f64;
+}
+
+float _sinf_cosf_special(float x, const char *name)
+{
+ UT32 xu;
+ unsigned int is_snan;
+
+ xu.f32 = x;
+
+ if((xu.u32 & EXPBITS_SP32) == EXPBITS_SP32)
+ {
+ // x is Inf or NaN
+ if((xu.u32 & MANTBITS_SP32) == 0x0)
+ {
+ // x is Inf
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+#ifdef WIN64
+ xu.u32 = INDEFBITPATT_SP32;
+ __amd_handle_errorf(DOMAIN, EDOM, name, x, 0, 0.0f, 0, xu.f32, 0);
+#else
+ xu.u32 = QNANBITPATT_SP32;
+ name = *(&name); // dummy statement to avoid warning
+#endif
+ }
+ else {
+ // x is NaN
+ is_snan = (((xu.u32 & QNAN_MASK_32) == QNAN_MASK_32) ? 0 : 1);
+ if(is_snan) {
+ xu.u32 |= QNAN_MASK_32;
+ _mm_setcsr(_mm_getcsr() | MXCSR_ES_INVALID);
+ }
+#ifdef WIN64
+ __amd_handle_errorf(DOMAIN, EDOM, name, x, is_snan, 0.0f, 0, xu.f32, 0);
+#endif
+ }
+
+ }
+
+ return xu.f32;
+}
+
+float _sinf_special(float x)
+{
+ return _sinf_cosf_special(x, "sinf");
+}
+
+double _sin_special(double x)
+{
+ return _sin_cos_special(x, "sin");
+}
+
+float _cosf_special(float x)
+{
+ return _sinf_cosf_special(x, "cosf");
+}
+
+double _cos_special(double x)
+{
+ return _sin_cos_special(x, "cos");
+}
+
+void _sincosf_special(float x, float *sy, float *cy)
+{
+ float xu = _sinf_cosf_special(x, "sincosf");
+
+ *sy = xu;
+ *cy = xu;
+
+ return;
+}
+
+void _sincos_special(double x, double *sy, double *cy)
+{
+ double xu = _sin_cos_special(x, "sincos");
+
+ *sy = xu;
+ *cy = xu;
+
+ return;
+}
diff --git a/src/sinh.c b/src/sinh.c
new file mode 100644
index 0000000..f22fee4
--- /dev/null
+++ b/src/sinh.c
@@ -0,0 +1,371 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_SPLITEXP
+#define USE_SCALEDOUBLE_1
+#define USE_SCALEDOUBLE_2
+#define USE_INFINITY_WITH_FLAGS
+#define USE_VAL_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_HANDLE_ERROR
+#undef USE_SPLITEXP
+#undef USE_SCALEDOUBLE_1
+#undef USE_SCALEDOUBLE_2
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+
+/* Deal with errno for out-of-range result */
+static inline double retval_errno_erange(double x, int xneg)
+{
+ struct exception exc;
+ exc.arg1 = x;
+ exc.arg2 = x;
+ exc.type = OVERFLOW;
+ exc.name = (char *)"sinh";
+ if (_LIB_VERSION == _SVID_)
+ {
+ if (xneg)
+ exc.retval = -HUGE;
+ else
+ exc.retval = HUGE;
+ }
+ else
+ {
+ if (xneg)
+ exc.retval = -infinity_with_flags(AMD_F_OVERFLOW);
+ else
+ exc.retval = infinity_with_flags(AMD_F_OVERFLOW);
+ }
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(ERANGE);
+ else if (!matherr(&exc))
+ __set_errno(ERANGE);
+ return exc.retval;
+}
+#endif
+
+double FN_PROTOTYPE(sinh)(double x)
+{
+ /*
+ After dealing with special cases the computation is split into
+ regions as follows:
+
+ abs(x) >= max_sinh_arg:
+ sinh(x) = sign(x)*Inf
+
+ abs(x) >= small_threshold:
+ sinh(x) = sign(x)*exp(abs(x))/2 computed using the
+ splitexp and scaleDouble functions as for exp_amd().
+
+ abs(x) < small_threshold:
+ compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0)))
+ sinh(x) is then sign(x)*z. */
+
+ static const double
+ max_sinh_arg = 7.10475860073943977113e+02, /* 0x408633ce8fb9f87e */
+ thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */
+ log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */
+ log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */
+ small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889;
+ /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */
+
+ /* Lead and tail tabulated values of sinh(i) and cosh(i)
+ for i = 0,...,36. The lead part has 26 leading bits. */
+
+ static const double sinh_lead[37] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.17520117759704589844e+00, /* 0x3ff2cd9fc0000000 */
+ 3.62686038017272949219e+00, /* 0x400d03cf60000000 */
+ 1.00178747177124023438e+01, /* 0x40240926e0000000 */
+ 2.72899169921875000000e+01, /* 0x403b4a3800000000 */
+ 7.42032089233398437500e+01, /* 0x40528d0160000000 */
+ 2.01713153839111328125e+02, /* 0x406936d228000000 */
+ 5.48316116333007812500e+02, /* 0x4081228768000000 */
+ 1.49047882080078125000e+03, /* 0x409749ea50000000 */
+ 4.05154187011718750000e+03, /* 0x40afa71570000000 */
+ 1.10132326660156250000e+04, /* 0x40c5829dc8000000 */
+ 2.99370708007812500000e+04, /* 0x40dd3c4488000000 */
+ 8.13773945312500000000e+04, /* 0x40f3de1650000000 */
+ 2.21206695312500000000e+05, /* 0x410b00b590000000 */
+ 6.01302140625000000000e+05, /* 0x412259ac48000000 */
+ 1.63450865625000000000e+06, /* 0x4138f0cca8000000 */
+ 4.44305525000000000000e+06, /* 0x4150f2ebd0000000 */
+ 1.20774762500000000000e+07, /* 0x4167093488000000 */
+ 3.28299845000000000000e+07, /* 0x417f4f2208000000 */
+ 8.92411500000000000000e+07, /* 0x419546d8f8000000 */
+ 2.42582596000000000000e+08, /* 0x41aceb0888000000 */
+ 6.59407856000000000000e+08, /* 0x41c3a6e1f8000000 */
+ 1.79245641600000000000e+09, /* 0x41dab5adb8000000 */
+ 4.87240166400000000000e+09, /* 0x41f226af30000000 */
+ 1.32445608960000000000e+10, /* 0x4208ab7fb0000000 */
+ 3.60024494080000000000e+10, /* 0x4220c3d390000000 */
+ 9.78648043520000000000e+10, /* 0x4236c93268000000 */
+ 2.66024116224000000000e+11, /* 0x424ef822f0000000 */
+ 7.23128516608000000000e+11, /* 0x42650bba30000000 */
+ 1.96566712320000000000e+12, /* 0x427c9aae40000000 */
+ 5.34323724288000000000e+12, /* 0x4293704708000000 */
+ 1.45244246507520000000e+13, /* 0x42aa6b7658000000 */
+ 3.94814795284480000000e+13, /* 0x42c1f43fc8000000 */
+ 1.07321789251584000000e+14, /* 0x42d866f348000000 */
+ 2.91730863685632000000e+14, /* 0x42f0953e28000000 */
+ 7.93006722514944000000e+14, /* 0x430689e220000000 */
+ 2.15561576592179200000e+15}; /* 0x431ea215a0000000 */
+
+ static const double sinh_tail[37] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.60467555584448807892e-08, /* 0x3e513ae6096a0092 */
+ 2.76742892754807136947e-08, /* 0x3e5db70cfb79a640 */
+ 2.09697499555224576530e-07, /* 0x3e8c2526b66dc067 */
+ 2.04940252448908240062e-07, /* 0x3e8b81b18647f380 */
+ 1.65444891522700935932e-06, /* 0x3ebbc1cdd1e1eb08 */
+ 3.53116789999998198721e-06, /* 0x3ecd9f201534fb09 */
+ 6.94023870987375490695e-06, /* 0x3edd1c064a4e9954 */
+ 4.98876893611587449271e-06, /* 0x3ed4eca65d06ea74 */
+ 3.19656024605152215752e-05, /* 0x3f00c259bcc0ecc5 */
+ 2.08687768377236501204e-04, /* 0x3f2b5a6647cf9016 */
+ 4.84668088325403796299e-05, /* 0x3f09691adefb0870 */
+ 1.17517985422733832468e-03, /* 0x3f53410fc29cde38 */
+ 6.90830086959560562415e-04, /* 0x3f46a31a50b6fb3c */
+ 1.45697262451506548420e-03, /* 0x3f57defc71805c40 */
+ 2.99859023684906737806e-02, /* 0x3f9eb49fd80e0bab */
+ 1.02538800507941396667e-02, /* 0x3f84fffc7bcd5920 */
+ 1.26787628407699110022e-01, /* 0x3fc03a93b6c63435 */
+ 6.86652479544033744752e-02, /* 0x3fb1940bb255fd1c */
+ 4.81593627621056619148e-01, /* 0x3fded26e14260b50 */
+ 1.70489513795397629181e+00, /* 0x3ffb47401fc9f2a2 */
+ 1.12416073482258713767e+01, /* 0x40267bb3f55634f1 */
+ 7.06579578070110514432e+00, /* 0x401c435ff8194ddc */
+ 5.91244512999659974639e+01, /* 0x404d8fee052ba63a */
+ 1.68921736147050694399e+02, /* 0x40651d7edccde3f6 */
+ 2.60692936262073658327e+02, /* 0x40704b1644557d1a */
+ 3.62419382134885609048e+02, /* 0x4076a6b5ca0a9dc4 */
+ 4.07689930834187271103e+03, /* 0x40afd9cc72249aba */
+ 1.55377375868385224749e+04, /* 0x40ce58de693edab5 */
+ 2.53720210371943067003e+04, /* 0x40d8c70158ac6363 */
+ 4.78822310734952334315e+04, /* 0x40e7614764f43e20 */
+ 1.81871712615542812273e+05, /* 0x4106337db36fc718 */
+ 5.62892347580489004031e+05, /* 0x41212d98b1f611e2 */
+ 6.41374032312148716301e+05, /* 0x412392bc108b37cc */
+ 7.57809544070145115256e+06, /* 0x415ce87bdc3473dc */
+ 3.64177136406482197344e+06, /* 0x414bc8d5ae99ad14 */
+ 7.63580561355670914054e+06}; /* 0x415d20d76744835c */
+
+ static const double cosh_lead[37] = {
+ 1.00000000000000000000e+00, /* 0x3ff0000000000000 */
+ 1.54308062791824340820e+00, /* 0x3ff8b07550000000 */
+ 3.76219564676284790039e+00, /* 0x400e18fa08000000 */
+ 1.00676617622375488281e+01, /* 0x402422a490000000 */
+ 2.73082327842712402344e+01, /* 0x403b4ee858000000 */
+ 7.42099475860595703125e+01, /* 0x40528d6fc8000000 */
+ 2.01715633392333984375e+02, /* 0x406936e678000000 */
+ 5.48317031860351562500e+02, /* 0x4081228948000000 */
+ 1.49047915649414062500e+03, /* 0x409749eaa8000000 */
+ 4.05154199218750000000e+03, /* 0x40afa71580000000 */
+ 1.10132329101562500000e+04, /* 0x40c5829dd0000000 */
+ 2.99370708007812500000e+04, /* 0x40dd3c4488000000 */
+ 8.13773945312500000000e+04, /* 0x40f3de1650000000 */
+ 2.21206695312500000000e+05, /* 0x410b00b590000000 */
+ 6.01302140625000000000e+05, /* 0x412259ac48000000 */
+ 1.63450865625000000000e+06, /* 0x4138f0cca8000000 */
+ 4.44305525000000000000e+06, /* 0x4150f2ebd0000000 */
+ 1.20774762500000000000e+07, /* 0x4167093488000000 */
+ 3.28299845000000000000e+07, /* 0x417f4f2208000000 */
+ 8.92411500000000000000e+07, /* 0x419546d8f8000000 */
+ 2.42582596000000000000e+08, /* 0x41aceb0888000000 */
+ 6.59407856000000000000e+08, /* 0x41c3a6e1f8000000 */
+ 1.79245641600000000000e+09, /* 0x41dab5adb8000000 */
+ 4.87240166400000000000e+09, /* 0x41f226af30000000 */
+ 1.32445608960000000000e+10, /* 0x4208ab7fb0000000 */
+ 3.60024494080000000000e+10, /* 0x4220c3d390000000 */
+ 9.78648043520000000000e+10, /* 0x4236c93268000000 */
+ 2.66024116224000000000e+11, /* 0x424ef822f0000000 */
+ 7.23128516608000000000e+11, /* 0x42650bba30000000 */
+ 1.96566712320000000000e+12, /* 0x427c9aae40000000 */
+ 5.34323724288000000000e+12, /* 0x4293704708000000 */
+ 1.45244246507520000000e+13, /* 0x42aa6b7658000000 */
+ 3.94814795284480000000e+13, /* 0x42c1f43fc8000000 */
+ 1.07321789251584000000e+14, /* 0x42d866f348000000 */
+ 2.91730863685632000000e+14, /* 0x42f0953e28000000 */
+ 7.93006722514944000000e+14, /* 0x430689e220000000 */
+ 2.15561576592179200000e+15}; /* 0x431ea215a0000000 */
+
+ static const double cosh_tail[37] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 6.89700037027478056904e-09, /* 0x3e3d9f5504c2bd28 */
+ 4.43207835591715833630e-08, /* 0x3e67cb66f0a4c9fd */
+ 2.33540217013828929694e-07, /* 0x3e8f58617928e588 */
+ 5.17452463948269748331e-08, /* 0x3e6bc7d000c38d48 */
+ 9.38728274131605919153e-07, /* 0x3eaf7f9d4e329998 */
+ 2.73012191010840495544e-06, /* 0x3ec6e6e464885269 */
+ 3.29486051438996307950e-06, /* 0x3ecba3a8b946c154 */
+ 4.75803746362771416375e-06, /* 0x3ed3f4e76110d5a4 */
+ 3.33050940471947692369e-05, /* 0x3f017622515a3e2b */
+ 9.94707313972136215365e-06, /* 0x3ee4dc4b528af3d0 */
+ 6.51685096227860253398e-05, /* 0x3f11156278615e10 */
+ 1.18132406658066663359e-03, /* 0x3f535ad50ed821f5 */
+ 6.93090416366541877541e-04, /* 0x3f46b61055f2935c */
+ 1.45780415323416845386e-03, /* 0x3f57e2794a601240 */
+ 2.99862082708111758744e-02, /* 0x3f9eb4b45f6aadd3 */
+ 1.02539925859688602072e-02, /* 0x3f85000b967b3698 */
+ 1.26787669807076286421e-01, /* 0x3fc03a940fadc092 */
+ 6.86652631843830962843e-02, /* 0x3fb1940bf3bf874c */
+ 4.81593633223853068159e-01, /* 0x3fded26e1a2a2110 */
+ 1.70489514001513020602e+00, /* 0x3ffb4740205796d6 */
+ 1.12416073489841270572e+01, /* 0x40267bb3f55cb85d */
+ 7.06579578098005001152e+00, /* 0x401c435ff81e18ac */
+ 5.91244513000686140458e+01, /* 0x404d8fee052bdea4 */
+ 1.68921736147088438429e+02, /* 0x40651d7edccde926 */
+ 2.60692936262087528121e+02, /* 0x40704b1644557e0e */
+ 3.62419382134890611269e+02, /* 0x4076a6b5ca0a9e1c */
+ 4.07689930834187453002e+03, /* 0x40afd9cc72249abe */
+ 1.55377375868385224749e+04, /* 0x40ce58de693edab5 */
+ 2.53720210371943103382e+04, /* 0x40d8c70158ac6364 */
+ 4.78822310734952334315e+04, /* 0x40e7614764f43e20 */
+ 1.81871712615542812273e+05, /* 0x4106337db36fc718 */
+ 5.62892347580489004031e+05, /* 0x41212d98b1f611e2 */
+ 6.41374032312148716301e+05, /* 0x412392bc108b37cc */
+ 7.57809544070145115256e+06, /* 0x415ce87bdc3473dc */
+ 3.64177136406482197344e+06, /* 0x414bc8d5ae99ad14 */
+ 7.63580561355670914054e+06}; /* 0x415d20d76744835c */
+
+ unsigned long long ux, aux, xneg;
+ double y, z, z1, z2;
+ int m;
+
+ /* Special cases */
+
+ GET_BITS_DP64(x, ux);
+ aux = ux & ~SIGNBIT_DP64;
+ if (aux < 0x3e30000000000000) /* |x| small enough that sinh(x) = x */
+ {
+ if (aux == 0)
+ /* with no inexact */
+ return x;
+ else
+ return val_with_flags(x, AMD_F_INEXACT);
+ }
+ else if (aux >= 0x7ff0000000000000) /* |x| is NaN or Inf */
+ {
+ return x + x;
+ }
+
+
+ xneg = (aux != ux);
+
+ y = x;
+ if (xneg) y = -x;
+
+ if (y >= max_sinh_arg)
+ {
+ /* Return +/-infinity with overflow flag */
+
+#ifdef WINDOWS
+ if (xneg)
+ return handle_error("sinh", NINFBITPATT_DP64, _OVERFLOW,
+ AMD_F_OVERFLOW, EDOM, x, 0.0F);
+ else
+ return handle_error("sinh", PINFBITPATT_DP64, _OVERFLOW,
+ AMD_F_OVERFLOW, ERANGE, x, 0.0F);
+#else
+ return retval_errno_erange(x, xneg);
+#endif
+ }
+ else if (y >= small_threshold)
+ {
+ /* In this range y is large enough so that
+ the negative exponential is negligible,
+ so sinh(y) is approximated by sign(x)*exp(y)/2. The
+ code below is an inlined version of that from
+ exp() with two changes (it operates on
+ y instead of x, and the division by 2 is
+ done by reducing m by 1). */
+
+ splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead,
+ log2_by_32_tail, &m, &z1, &z2);
+ m -= 1;
+
+ if (m >= EMIN_DP64 && m <= EMAX_DP64)
+ z = scaleDouble_1((z1+z2),m);
+ else
+ z = scaleDouble_2((z1+z2),m);
+ }
+ else
+ {
+ /* In this range we find the integer part y0 of y
+ and the increment dy = y - y0. We then compute
+
+ z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy)
+
+ where sinh(y0) and cosh(y0) are tabulated above. */
+
+ int ind;
+ double dy, dy2, sdy, cdy, sdy1, sdy2;
+
+ ind = (int)y;
+ dy = y - ind;
+
+ dy2 = dy*dy;
+ sdy = dy*dy2*(0.166666666666666667013899e0 +
+ (0.833333333333329931873097e-2 +
+ (0.198412698413242405162014e-3 +
+ (0.275573191913636406057211e-5 +
+ (0.250521176994133472333666e-7 +
+ (0.160576793121939886190847e-9 +
+ 0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+ cdy = dy2*(0.500000000000000005911074e0 +
+ (0.416666666666660876512776e-1 +
+ (0.138888888889814854814536e-2 +
+ (0.248015872460622433115785e-4 +
+ (0.275573350756016588011357e-6 +
+ (0.208744349831471353536305e-8 +
+ 0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+ /* At this point sinh(dy) is approximated by dy + sdy.
+ Shift some significant bits from dy to sdy. */
+
+ GET_BITS_DP64(dy, ux);
+ ux &= 0xfffffffff8000000;
+ PUT_BITS_DP64(ux, sdy1);
+ sdy2 = sdy + (dy - sdy1);
+
+ z = ((((((cosh_tail[ind]*sdy2 + sinh_tail[ind]*cdy)
+ + cosh_tail[ind]*sdy1) + sinh_tail[ind])
+ + cosh_lead[ind]*sdy2) + sinh_lead[ind]*cdy)
+ + cosh_lead[ind]*sdy1) + sinh_lead[ind];
+ }
+
+ if (xneg) z = - z;
+ return z;
+}
+
+weak_alias (__sinh, sinh)
diff --git a/src/sinhf.c b/src/sinhf.c
new file mode 100644
index 0000000..eaad0fd
--- /dev/null
+++ b/src/sinhf.c
@@ -0,0 +1,292 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_SPLITEXP
+#define USE_SCALEDOUBLE_1
+#define USE_INFINITY_WITH_FLAGS
+#define USE_VALF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_SPLITEXP
+#undef USE_SCALEDOUBLE_1
+#undef USE_INFINITY_WITH_FLAGS
+#undef USE_VALF_WITH_FLAGS
+#undef USE_HANDLE_ERRORF
+
+#include "../inc/libm_errno_amd.h"
+
+#ifndef WINDOWS
+/* Deal with errno for out-of-range result */
+static inline float retval_errno_erange(float x, int xneg)
+{
+ struct exception exc;
+ exc.arg1 = (double)x;
+ exc.arg2 = (double)x;
+ exc.type = OVERFLOW;
+ exc.name = (char *)"sinhf";
+ if (_LIB_VERSION == _SVID_)
+ {
+ if (xneg)
+ exc.retval = -HUGE;
+ else
+ exc.retval = HUGE;
+ }
+ else
+ {
+ if (xneg)
+ exc.retval = -infinity_with_flags(AMD_F_OVERFLOW);
+ else
+ exc.retval = infinity_with_flags(AMD_F_OVERFLOW);
+ }
+ if (_LIB_VERSION == _POSIX_)
+ __set_errno(ERANGE);
+ else if (!matherr(&exc))
+ __set_errno(ERANGE);
+ return exc.retval;
+}
+#endif
+
+#ifdef WINDOWS
+#pragma function(sinhf)
+#endif
+
+float FN_PROTOTYPE(sinhf)(float fx)
+{
+ /*
+ After dealing with special cases the computation is split into
+ regions as follows:
+
+ abs(x) >= max_sinh_arg:
+ sinh(x) = sign(x)*Inf
+
+ abs(x) >= small_threshold:
+ sinh(x) = sign(x)*exp(abs(x))/2 computed using the
+ splitexp and scaleDouble functions as for exp_amd().
+
+ abs(x) < small_threshold:
+ compute p = exp(y) - 1 and then z = 0.5*(p+(p/(p+1.0)))
+ sinh(x) is then sign(x)*z. */
+
+ static const double
+ /* The max argument of sinhf, but stored as a double */
+ max_sinh_arg = 8.94159862922329438106e+01, /* 0x40565a9f84f82e63 */
+ thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */
+ log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */
+ log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */
+ small_threshold = 8*BASEDIGITS_DP64*0.30102999566398119521373889;
+ /* (8*BASEDIGITS_DP64*log10of2) ' exp(-x) insignificant c.f. exp(x) */
+
+ /* Tabulated values of sinh(i) and cosh(i) for i = 0,...,36. */
+
+ static const double sinh_lead[37] = {
+ 0.00000000000000000000e+00, /* 0x0000000000000000 */
+ 1.17520119364380137839e+00, /* 0x3ff2cd9fc44eb982 */
+ 3.62686040784701857476e+00, /* 0x400d03cf63b6e19f */
+ 1.00178749274099008204e+01, /* 0x40240926e70949ad */
+ 2.72899171971277496596e+01, /* 0x403b4a3803703630 */
+ 7.42032105777887522891e+01, /* 0x40528d0166f07374 */
+ 2.01713157370279219549e+02, /* 0x406936d22f67c805 */
+ 5.48316123273246489589e+02, /* 0x408122876ba380c9 */
+ 1.49047882578955000099e+03, /* 0x409749ea514eca65 */
+ 4.05154190208278987484e+03, /* 0x40afa7157430966f */
+ 1.10132328747033916443e+04, /* 0x40c5829dced69991 */
+ 2.99370708492480553105e+04, /* 0x40dd3c4488cb48d6 */
+ 8.13773957064298447222e+04, /* 0x40f3de1654d043f0 */
+ 2.21206696003330085659e+05, /* 0x410b00b5916a31a5 */
+ 6.01302142081972560845e+05, /* 0x412259ac48bef7e3 */
+ 1.63450868623590236530e+06, /* 0x4138f0ccafad27f6 */
+ 4.44305526025387924165e+06, /* 0x4150f2ebd0a7ffe3 */
+ 1.20774763767876271158e+07, /* 0x416709348c0ea4ed */
+ 3.28299845686652474105e+07, /* 0x417f4f22091940bb */
+ 8.92411504815936237574e+07, /* 0x419546d8f9ed26e1 */
+ 2.42582597704895108938e+08, /* 0x41aceb088b68e803 */
+ 6.59407867241607308388e+08, /* 0x41c3a6e1fd9eecfd */
+ 1.79245642306579566002e+09, /* 0x41dab5adb9c435ff */
+ 4.87240172312445068359e+09, /* 0x41f226af33b1fdc0 */
+ 1.32445610649217357635e+10, /* 0x4208ab7fb5475fb7 */
+ 3.60024496686929321289e+10, /* 0x4220c3d3920962c8 */
+ 9.78648047144193725586e+10, /* 0x4236c932696a6b5c */
+ 2.66024120300899291992e+11, /* 0x424ef822f7f6731c */
+ 7.23128532145737548828e+11, /* 0x42650bba3796379a */
+ 1.96566714857202099609e+12, /* 0x427c9aae4631c056 */
+ 5.34323729076223046875e+12, /* 0x429370470aec28ec */
+ 1.45244248326237109375e+13, /* 0x42aa6b765d8cdf6c */
+ 3.94814800913403437500e+13, /* 0x42c1f43fcc4b662c */
+ 1.07321789892958031250e+14, /* 0x42d866f34a725782 */
+ 2.91730871263727437500e+14, /* 0x42f0953e2f3a1ef7 */
+ 7.93006726156715250000e+14, /* 0x430689e221bc8d5a */
+ 2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */
+
+ static const double cosh_lead[37] = {
+ 1.00000000000000000000e+00, /* 0x3ff0000000000000 */
+ 1.54308063481524371241e+00, /* 0x3ff8b07551d9f550 */
+ 3.76219569108363138810e+00, /* 0x400e18fa0df2d9bc */
+ 1.00676619957777653269e+01, /* 0x402422a497d6185e */
+ 2.73082328360164865444e+01, /* 0x403b4ee858de3e80 */
+ 7.42099485247878334349e+01, /* 0x40528d6fcbeff3a9 */
+ 2.01715636122455890700e+02, /* 0x406936e67db9b919 */
+ 5.48317035155212010977e+02, /* 0x4081228949ba3a8b */
+ 1.49047916125217807348e+03, /* 0x409749eaa93f4e76 */
+ 4.05154202549259389343e+03, /* 0x40afa715845d8894 */
+ 1.10132329201033226127e+04, /* 0x40c5829dd053712d */
+ 2.99370708659497577173e+04, /* 0x40dd3c4489115627 */
+ 8.13773957125740562333e+04, /* 0x40f3de1654d6b543 */
+ 2.21206696005590405548e+05, /* 0x410b00b5916b6105 */
+ 6.01302142082804115489e+05, /* 0x412259ac48bf13ca */
+ 1.63450868623620807193e+06, /* 0x4138f0ccafad2d17 */
+ 4.44305526025399193168e+06, /* 0x4150f2ebd0a8005c */
+ 1.20774763767876680940e+07, /* 0x416709348c0ea503 */
+ 3.28299845686652623117e+07, /* 0x417f4f22091940bf */
+ 8.92411504815936237574e+07, /* 0x419546d8f9ed26e1 */
+ 2.42582597704895138741e+08, /* 0x41aceb088b68e804 */
+ 6.59407867241607308388e+08, /* 0x41c3a6e1fd9eecfd */
+ 1.79245642306579566002e+09, /* 0x41dab5adb9c435ff */
+ 4.87240172312445068359e+09, /* 0x41f226af33b1fdc0 */
+ 1.32445610649217357635e+10, /* 0x4208ab7fb5475fb7 */
+ 3.60024496686929321289e+10, /* 0x4220c3d3920962c8 */
+ 9.78648047144193725586e+10, /* 0x4236c932696a6b5c */
+ 2.66024120300899291992e+11, /* 0x424ef822f7f6731c */
+ 7.23128532145737548828e+11, /* 0x42650bba3796379a */
+ 1.96566714857202099609e+12, /* 0x427c9aae4631c056 */
+ 5.34323729076223046875e+12, /* 0x429370470aec28ec */
+ 1.45244248326237109375e+13, /* 0x42aa6b765d8cdf6c */
+ 3.94814800913403437500e+13, /* 0x42c1f43fcc4b662c */
+ 1.07321789892958031250e+14, /* 0x42d866f34a725782 */
+ 2.91730871263727437500e+14, /* 0x42f0953e2f3a1ef7 */
+ 7.93006726156715250000e+14, /* 0x430689e221bc8d5a */
+ 2.15561577355759750000e+15}; /* 0x431ea215a1d20d76 */
+
+ unsigned long long ux, aux, xneg;
+ double x = fx, y, z, z1, z2;
+ int m;
+
+ /* Special cases */
+
+ GET_BITS_DP64(x, ux);
+ aux = ux & ~SIGNBIT_DP64;
+ if (aux < 0x3f10000000000000) /* |x| small enough that sinh(x) = x */
+ {
+ if (aux == 0)
+ /* with no inexact */
+ return fx;
+ else
+ return valf_with_flags(fx, AMD_F_INEXACT);
+ }
+ else if (aux >= 0x7ff0000000000000) /* |x| is NaN or Inf */
+ {
+#ifdef WINDOWS
+ if (aux > 0x7ff0000000000000)
+ {
+ /* x is NaN */
+ unsigned int uhx;
+ GET_BITS_SP32(fx, uhx);
+ return handle_errorf("sinhf", uhx|0x00400000, _DOMAIN,
+ AMD_F_INVALID, EDOM, fx, 0.0F);
+ }
+ else
+#endif
+ return fx + fx;
+ }
+
+ xneg = (aux != ux);
+
+ y = x;
+ if (xneg) y = -x;
+
+ if (y >= max_sinh_arg)
+ {
+ /* Return infinity with overflow flag. */
+#ifdef WINDOWS
+ if (xneg)
+ return handle_errorf("sinhf", NINFBITPATT_SP32, _OVERFLOW,
+ AMD_F_OVERFLOW, ERANGE, fx, 0.0F);
+ else
+ return handle_errorf("sinhf", PINFBITPATT_SP32, _OVERFLOW,
+ AMD_F_OVERFLOW, ERANGE, fx, 0.0F);
+#else
+ /* This handles POSIX behaviour */
+ __set_errno(ERANGE);
+ z = infinity_with_flags(AMD_F_OVERFLOW);
+#endif
+ }
+ else if (y >= small_threshold)
+ {
+ /* In this range y is large enough so that
+ the negative exponential is negligible,
+ so sinh(y) is approximated by sign(x)*exp(y)/2. The
+ code below is an inlined version of that from
+ exp() with two changes (it operates on
+ y instead of x, and the division by 2 is
+ done by reducing m by 1). */
+
+ splitexp(y, 1.0, thirtytwo_by_log2, log2_by_32_lead,
+ log2_by_32_tail, &m, &z1, &z2);
+ m -= 1;
+ /* scaleDouble_1 is always safe because the argument x was
+ float, rather than double */
+ z = scaleDouble_1((z1+z2),m);
+ }
+ else
+ {
+ /* In this range we find the integer part y0 of y
+ and the increment dy = y - y0. We then compute
+
+ z = sinh(y) = sinh(y0)cosh(dy) + cosh(y0)sinh(dy)
+
+ where sinh(y0) and cosh(y0) are tabulated above. */
+
+ int ind;
+ double dy, dy2, sdy, cdy;
+
+ ind = (int)y;
+ dy = y - ind;
+
+ dy2 = dy*dy;
+
+ sdy = dy + dy*dy2*(0.166666666666666667013899e0 +
+ (0.833333333333329931873097e-2 +
+ (0.198412698413242405162014e-3 +
+ (0.275573191913636406057211e-5 +
+ (0.250521176994133472333666e-7 +
+ (0.160576793121939886190847e-9 +
+ 0.7746188980094184251527126e-12*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+ cdy = 1 + dy2*(0.500000000000000005911074e0 +
+ (0.416666666666660876512776e-1 +
+ (0.138888888889814854814536e-2 +
+ (0.248015872460622433115785e-4 +
+ (0.275573350756016588011357e-6 +
+ (0.208744349831471353536305e-8 +
+ 0.1163921388172173692062032e-10*dy2)*dy2)*dy2)*dy2)*dy2)*dy2);
+
+ z = sinh_lead[ind]*cdy + cosh_lead[ind]*sdy;
+ }
+
+ if (xneg) z = - z;
+ return (float)z;
+}
+
+weak_alias (__sinhf, sinhf)
diff --git a/src/sqrt.c b/src/sqrt.c
new file mode 100644
index 0000000..14c5b1e
--- /dev/null
+++ b/src/sqrt.c
@@ -0,0 +1,65 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include <emmintrin.h>
+#include <math.h>
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+#include "../inc/libm_special.h"
+
+#ifdef WINDOWS
+#pragma function(sqrt)
+#endif
+/*SSE2 contains an instruction SQRTSD. This instruction Computes the square root
+ of the low-order double-precision floating-point value in an XMM register
+ or in a 64-bit memory location and writes the result in the low-order quadword
+ of another XMM register. The corresponding intrinsic is _mm_sqrt_sd()*/
+double FN_PROTOTYPE(sqrt)(double x)
+{
+ __m128d X128;
+ double result;
+ UT64 uresult;
+
+ if(x < 0.0)
+ {
+ uresult.u64 = 0xfff8000000000000;
+ __amd_handle_error(DOMAIN, EDOM, "sqrt", x, 0.0 , uresult.f64);
+ return uresult.f64;
+ }
+ /*Load x into an XMM register*/
+ X128 = _mm_load_sd(&x);
+ /*Calculate sqrt using SQRTSD instrunction*/
+ X128 = _mm_sqrt_sd(X128, X128);
+ /*Store back the result into a double precision floating point number*/
+ _mm_store_sd(&result, X128);
+ return result;
+}
+
+
diff --git a/src/sqrtf.c b/src/sqrtf.c
new file mode 100644
index 0000000..48e53cd
--- /dev/null
+++ b/src/sqrtf.c
@@ -0,0 +1,73 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include <emmintrin.h>
+#include <math.h>
+#ifdef WIN64
+#include <fpieee.h>
+#else
+#include <errno.h>
+#endif
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+#include "../inc/libm_special.h"
+
+#ifdef WINDOWS
+#pragma function(sqrtf)
+#endif
+/*SSE2 contains an instruction SQRTSS. This instruction Computes the square root
+ of the low-order single-precision floating-point value in an XMM register
+ or in a 32-bit memory location and writes the result in the low-order doubleword
+ of another XMM register. The corresponding intrinsic is _mm_sqrt_ss()*/
+float FN_PROTOTYPE(sqrtf)(float x)
+{
+ __m128 X128;
+ float result;
+ UT32 uresult;
+
+ if(x < 0.0)
+ {
+ uresult.u32 = 0xffc00000;
+
+ {
+ unsigned int is_x_snan;
+ UT32 xm; xm.f32 = x;
+ is_x_snan = ( ((xm.u32 & QNAN_MASK_32) == 0) ? 1 : 0 );
+ __amd_handle_errorf(DOMAIN, EDOM, "sqrt", x, is_x_snan, 0.0f, 0, uresult.f32, 0);
+ }
+
+ return uresult.f32;
+ }
+
+ /*Load x into an XMM register*/
+ X128 = _mm_load_ss(&x);
+ /*Calculate sqrt using SQRTSS instrunction*/
+ X128 = _mm_sqrt_ss(X128);
+ /*Store back the result into a single precision floating point number*/
+ _mm_store_ss(&result, X128);
+ return result;
+}
+
+
diff --git a/src/tan.c b/src/tan.c
new file mode 100644
index 0000000..a7fe651
--- /dev/null
+++ b/src/tan.c
@@ -0,0 +1,260 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+#define USE_NAN_WITH_FLAGS
+#define USE_VAL_WITH_FLAGS
+#define USE_HANDLE_ERROR
+#include "../inc/libm_inlines_amd.h"
+#undef USE_NAN_WITH_FLAGS
+#undef USE_VAL_WITH_FLAGS
+#undef USE_HANDLE_ERROR
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#endif
+
+extern void __amd_remainder_piby2(double x, double *r, double *rr, int *region);
+
+/* tan(x + xx) approximation valid on the interval [-pi/4,pi/4].
+ If recip is true return -1/tan(x + xx) instead. */
+static inline double tan_piby4(double x, double xx, int recip)
+{
+ double r, t1, t2, xl;
+ int transform = 0;
+ static const double
+ piby4_lead = 7.85398163397448278999e-01, /* 0x3fe921fb54442d18 */
+ piby4_tail = 3.06161699786838240164e-17; /* 0x3c81a62633145c06 */
+
+ /* In order to maintain relative precision transform using the identity:
+ tan(pi/4-x) = (1-tan(x))/(1+tan(x)) for arguments close to pi/4.
+ Similarly use tan(x-pi/4) = (tan(x)-1)/(tan(x)+1) close to -pi/4. */
+
+ if (x > 0.68)
+ {
+ transform = 1;
+ x = piby4_lead - x;
+ xl = piby4_tail - xx;
+ x += xl;
+ xx = 0.0;
+ }
+ else if (x < -0.68)
+ {
+ transform = -1;
+ x = piby4_lead + x;
+ xl = piby4_tail + xx;
+ x += xl;
+ xx = 0.0;
+ }
+
+ /* Core Remez [2,3] approximation to tan(x+xx) on the
+ interval [0,0.68]. */
+
+ r = x*x + 2.0 * x * xx;
+ t1 = x;
+ t2 = xx + x*r*
+ (0.372379159759792203640806338901e0 +
+ (-0.229345080057565662883358588111e-1 +
+ 0.224044448537022097264602535574e-3*r)*r)/
+ (0.111713747927937668539901657944e1 +
+ (-0.515658515729031149329237816945e0 +
+ (0.260656620398645407524064091208e-1 -
+ 0.232371494088563558304549252913e-3*r)*r)*r);
+
+ /* Reconstruct tan(x) in the transformed case. */
+
+ if (transform)
+ {
+ double t;
+ t = t1 + t2;
+ if (recip)
+ return transform*(2*t/(t-1) - 1.0);
+ else
+ return transform*(1.0 - 2*t/(1+t));
+ }
+
+ if (recip)
+ {
+ /* Compute -1.0/(t1 + t2) accurately */
+ double trec, trec_top, z1, z2, t;
+ unsigned long long u;
+ t = t1 + t2;
+ GET_BITS_DP64(t, u);
+ u &= 0xffffffff00000000;
+ PUT_BITS_DP64(u, z1);
+ z2 = t2 - (z1 - t1);
+ trec = -1.0 / t;
+ GET_BITS_DP64(trec, u);
+ u &= 0xffffffff00000000;
+ PUT_BITS_DP64(u, trec_top);
+ return trec_top + trec * ((1.0 + trec_top * z1) + trec_top * z2);
+
+ }
+ else
+ return t1 + t2;
+}
+
+#ifdef WINDOWS
+#pragma function(tan)
+#endif
+
+double FN_PROTOTYPE(tan)(double x)
+{
+ double r, rr;
+ int region, xneg;
+
+ unsigned long long ux, ax;
+ GET_BITS_DP64(x, ux);
+ ax = (ux & ~SIGNBIT_DP64);
+ if (ax <= 0x3fe921fb54442d18) /* abs(x) <= pi/4 */
+ {
+ if (ax < 0x3f20000000000000) /* abs(x) < 2.0^(-13) */
+ {
+ if (ax < 0x3e40000000000000) /* abs(x) < 2.0^(-27) */
+ {
+ if (ax == 0x0000000000000000) return x;
+ else return val_with_flags(x, AMD_F_INEXACT);
+ }
+ else
+ {
+#ifdef WINDOWS
+ /* Using a temporary variable prevents 64-bit VC++ from
+ rearranging
+ x + x*x*x*0.333333333333333333;
+ into
+ x * (1 + x*x*0.333333333333333333);
+ The latter results in an incorrectly rounded answer. */
+ double tmp;
+ tmp = x*x*x*0.333333333333333333;
+ return x + tmp;
+#else
+ return x + x*x*x*0.333333333333333333;
+#endif
+ }
+ }
+ else
+ return tan_piby4(x, 0.0, 0);
+ }
+ else if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_DP64)
+ /* x is NaN */
+#ifdef WINDOWS
+ return handle_error("tan", ux|0x0008000000000000, _DOMAIN, 0,
+ EDOM, x, 0.0);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ else
+ /* x is infinity. Return a NaN */
+#ifdef WINDOWS
+ return handle_error("tan", INDEFBITPATT_DP64, _DOMAIN, 0,
+ EDOM, x, 0.0);
+#else
+ return nan_with_flags(AMD_F_INVALID);
+#endif
+ }
+ xneg = (ax != ux);
+
+
+ if (xneg)
+ x = -x;
+
+ if (x < 5.0e5)
+ {
+ /* For these size arguments we can just carefully subtract the
+ appropriate multiple of pi/2, using extra precision where
+ x is close to an exact multiple of pi/2 */
+ static const double
+ twobypi = 6.36619772367581382433e-01, /* 0x3fe45f306dc9c883 */
+ piby2_1 = 1.57079632673412561417e+00, /* 0x3ff921fb54400000 */
+ piby2_1tail = 6.07710050650619224932e-11, /* 0x3dd0b4611a626331 */
+ piby2_2 = 6.07710050630396597660e-11, /* 0x3dd0b4611a600000 */
+ piby2_2tail = 2.02226624879595063154e-21, /* 0x3ba3198a2e037073 */
+ piby2_3 = 2.02226624871116645580e-21, /* 0x3ba3198a2e000000 */
+ piby2_3tail = 8.47842766036889956997e-32; /* 0x397b839a252049c1 */
+ double t, rhead, rtail;
+ int npi2;
+ unsigned long long uy, xexp, expdiff;
+ xexp = ax >> EXPSHIFTBITS_DP64;
+ /* How many pi/2 is x a multiple of? */
+ if (ax <= 0x400f6a7a2955385e) /* 5pi/4 */
+ {
+ if (ax <= 0x4002d97c7f3321d2) /* 3pi/4 */
+ npi2 = 1;
+ else
+ npi2 = 2;
+ }
+ else if (ax <= 0x401c463abeccb2bb) /* 9pi/4 */
+ {
+ if (ax <= 0x4015fdbbe9bba775) /* 7pi/4 */
+ npi2 = 3;
+ else
+ npi2 = 4;
+ }
+ else
+ npi2 = (int)(x * twobypi + 0.5);
+ /* Subtract the multiple from x to get an extra-precision remainder */
+ rhead = x - npi2 * piby2_1;
+ rtail = npi2 * piby2_1tail;
+ GET_BITS_DP64(rhead, uy);
+ expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ if (expdiff > 15)
+ {
+ /* The remainder is pretty small compared with x, which
+ implies that x is a near multiple of pi/2
+ (x matches the multiple to at least 15 bits) */
+ t = rhead;
+ rtail = npi2 * piby2_2;
+ rhead = t - rtail;
+ rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ if (expdiff > 48)
+ {
+ /* x matches a pi/2 multiple to at least 48 bits */
+ t = rhead;
+ rtail = npi2 * piby2_3;
+ rhead = t - rtail;
+ rtail = npi2 * piby2_3tail - ((t - rhead) - rtail);
+ }
+ }
+ r = rhead - rtail;
+ rr = (rhead - r) - rtail;
+ region = npi2 & 3;
+ }
+ else
+ {
+ /* Reduce x into range [-pi/4,pi/4] */
+ __amd_remainder_piby2(x, &r, &rr, ®ion);
+ /* __remainder_piby2(x, &r, &rr, ®ion);*/
+ }
+
+ if (xneg)
+ return -tan_piby4(r, rr, region & 1);
+ else
+ return tan_piby4(r, rr, region & 1);
+}
+
+weak_alias (__tan, tan)
diff --git a/src/tanf.c b/src/tanf.c
new file mode 100644
index 0000000..856cdcf
--- /dev/null
+++ b/src/tanf.c
@@ -0,0 +1,203 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+/*#define USE_REMAINDER_PIBY2F_INLINE*/
+#define USE_VALF_WITH_FLAGS
+#define USE_NANF_WITH_FLAGS
+#define USE_HANDLE_ERRORF
+#include "../inc/libm_inlines_amd.h"
+#undef USE_VALF_WITH_FLAGS
+#undef USE_NANF_WITH_FLAGS
+/*#undef USE_REMAINDER_PIBY2F_INLINE*/
+#undef USE_HANDLE_ERRORF
+
+#ifdef WINDOWS
+#include "../inc/libm_errno_amd.h"
+#endif
+
+extern void __amd_remainder_piby2d2f(unsigned long long ux, double *r, int *region);
+
+/* tan(x) approximation valid on the interval [-pi/4,pi/4].
+ If recip is true return -1/tan(x) instead. */
+static inline double tanf_piby4(double x, int recip)
+{
+ double r, t;
+
+ /* Core Remez [1,2] approximation to tan(x) on the
+ interval [0,pi/4]. */
+ r = x*x;
+ t = x + x*r*
+ (0.385296071263995406715129e0 -
+ 0.172032480471481694693109e-1 * r) /
+ (0.115588821434688393452299e+1 +
+ (-0.51396505478854532132342e0 +
+ 0.1844239256901656082986661e-1 * r) * r);
+
+ if (recip)
+ return -1.0 / t;
+ else
+ return t;
+}
+
+#ifdef WINDOWS
+#pragma function(tanf)
+#endif
+
+float FN_PROTOTYPE(tanf)(float x)
+{
+ double r, dx;
+ int region, xneg;
+
+ unsigned long long ux, ax;
+
+ dx = x;
+
+ GET_BITS_DP64(dx, ux);
+ ax = (ux & ~SIGNBIT_DP64);
+
+ if (ax <= 0x3fe921fb54442d18LL) /* abs(x) <= pi/4 */
+ {
+ if (ax < 0x3f80000000000000LL) /* abs(x) < 2.0^(-7) */
+ {
+ if (ax < 0x3f20000000000000LL) /* abs(x) < 2.0^(-13) */
+ {
+ if (ax == 0x0000000000000000LL)
+ return x;
+ else
+ return valf_with_flags(x, AMD_F_INEXACT);
+ }
+ else
+ return (float)(dx + dx*dx*dx*0.333333333333333333);
+ }
+ else
+ return (float)tanf_piby4(x, 0);
+ }
+ else if ((ux & EXPBITS_DP64) == EXPBITS_DP64)
+ {
+ /* x is either NaN or infinity */
+ if (ux & MANTBITS_DP64)
+ {
+ /* x is NaN */
+#ifdef WINDOWS
+ unsigned int ufx;
+ GET_BITS_SP32(x, ufx);
+ return handle_errorf("tanf", ufx|0x00400000, _DOMAIN, 0,
+ EDOM, x, 0.0F);
+#else
+ return x + x; /* Raise invalid if it is a signalling NaN */
+#endif
+ }
+ else
+ {
+ /* x is infinity. Return a NaN */
+#ifdef WINDOWS
+ return handle_errorf("tanf", INDEFBITPATT_SP32, _DOMAIN, 0,
+ EDOM, x, 0.0F);
+#else
+ return nanf_with_flags(AMD_F_INVALID);
+#endif
+ }
+ }
+
+ xneg = (int)(ux >> 63);
+
+ if (xneg)
+ dx = -dx;
+
+ if (dx < 5.0e5)
+ {
+ /* For these size arguments we can just carefully subtract the
+ appropriate multiple of pi/2, using extra precision where
+ dx is close to an exact multiple of pi/2 */
+ static const double
+ twobypi = 6.36619772367581382433e-01, /* 0x3fe45f306dc9c883 */
+ piby2_1 = 1.57079632673412561417e+00, /* 0x3ff921fb54400000 */
+ piby2_1tail = 6.07710050650619224932e-11, /* 0x3dd0b4611a626331 */
+ piby2_2 = 6.07710050630396597660e-11, /* 0x3dd0b4611a600000 */
+ piby2_2tail = 2.02226624879595063154e-21, /* 0x3ba3198a2e037073 */
+ piby2_3 = 2.02226624871116645580e-21, /* 0x3ba3198a2e000000 */
+ piby2_3tail = 8.47842766036889956997e-32; /* 0x397b839a252049c1 */
+ double t, rhead, rtail;
+ int npi2;
+ unsigned long long uy, xexp, expdiff;
+ xexp = ax >> EXPSHIFTBITS_DP64;
+ /* How many pi/2 is dx a multiple of? */
+ if (ax <= 0x400f6a7a2955385eLL) /* 5pi/4 */
+ {
+ if (ax <= 0x4002d97c7f3321d2LL) /* 3pi/4 */
+ npi2 = 1;
+ else
+ npi2 = 2;
+ }
+ else if (ax <= 0x401c463abeccb2bbLL) /* 9pi/4 */
+ {
+ if (ax <= 0x4015fdbbe9bba775LL) /* 7pi/4 */
+ npi2 = 3;
+ else
+ npi2 = 4;
+ }
+ else
+ npi2 = (int)(dx * twobypi + 0.5);
+ /* Subtract the multiple from dx to get an extra-precision remainder */
+ rhead = dx - npi2 * piby2_1;
+ rtail = npi2 * piby2_1tail;
+ GET_BITS_DP64(rhead, uy);
+ expdiff = xexp - ((uy & EXPBITS_DP64) >> EXPSHIFTBITS_DP64);
+ if (expdiff > 15)
+ {
+ /* The remainder is pretty small compared with dx, which
+ implies that dx is a near multiple of pi/2
+ (dx matches the multiple to at least 15 bits) */
+ t = rhead;
+ rtail = npi2 * piby2_2;
+ rhead = t - rtail;
+ rtail = npi2 * piby2_2tail - ((t - rhead) - rtail);
+ if (expdiff > 48)
+ {
+ /* dx matches a pi/2 multiple to at least 48 bits */
+ t = rhead;
+ rtail = npi2 * piby2_3;
+ rhead = t - rtail;
+ rtail = npi2 * piby2_3tail - ((t - rhead) - rtail);
+ }
+ }
+ r = rhead - rtail;
+ region = npi2 & 3;
+ }
+ else
+ {
+ /* Reduce x into range [-pi/4,pi/4] */
+ __amd_remainder_piby2d2f(ax, &r, ®ion);
+ }
+
+ if (xneg)
+ return (float)-tanf_piby4(r, region & 1);
+ else
+ return (float)tanf_piby4(r, region & 1);
+}
+
+weak_alias (__tanf, tanf)
diff --git a/src/tanh.c b/src/tanh.c
new file mode 100644
index 0000000..ead758b
--- /dev/null
+++ b/src/tanh.c
@@ -0,0 +1,129 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+#define USE_SPLITEXP
+#define USE_SCALEDOUBLE_2
+#define USE_VAL_WITH_FLAGS
+#include "../inc/libm_inlines_amd.h"
+#undef USE_SPLITEXP
+#undef USE_SCALEDOUBLE_2
+#undef USE_VAL_WITH_FLAGS
+
+double FN_PROTOTYPE(tanh)(double x)
+{
+ /*
+ The definition of tanh(x) is sinh(x)/cosh(x), which is also equivalent
+ to the following three formulae:
+ 1. (exp(x) - exp(-x))/(exp(x) + exp(-x))
+ 2. (1 - (2/(exp(2*x) + 1 )))
+ 3. (exp(2*x) - 1)/(exp(2*x) + 1)
+ but computationally, some formulae are better on some ranges.
+ */
+ static const double
+ thirtytwo_by_log2 = 4.61662413084468283841e+01, /* 0x40471547652b82fe */
+ log2_by_32_lead = 2.16608493356034159660e-02, /* 0x3f962e42fe000000 */
+ log2_by_32_tail = 5.68948749532545630390e-11, /* 0x3dcf473de6af278e */
+ large_threshold = 20.0; /* 0x4034000000000000 */
+
+ unsigned long long ux, aux, xneg;
+ double y, z, p, z1, z2;
+ int m;
+
+ /* Special cases */
+
+ GET_BITS_DP64(x, ux);
+ aux = ux & ~SIGNBIT_DP64;
+ if (aux < 0x3e30000000000000) /* |x| small enough that tanh(x) = x */
+ {
+ if (aux == 0)
+ return x; /* with no inexact */
+ else
+ return val_with_flags(x, AMD_F_INEXACT);
+ }
+ else if (aux > 0x7ff0000000000000) /* |x| is NaN */
+ return x + x;
+
+ xneg = (aux != ux);
+
+ y = x;
+ if (xneg) y = -x;
+
+ if (y > large_threshold)
+ {
+ /* If x is large then exp(-x) is negligible and
+ formula 1 reduces to plus or minus 1.0 */
+ z = 1.0;
+ }
+ else if (y <= 1.0)
+ {
+ double y2;
+ y2 = y*y;
+ if (y < 0.9)
+ {
+ /* Use a [3,3] Remez approximation on [0,0.9]. */
+ z = y + y*y2*
+ (-0.274030424656179760118928e0 +
+ (-0.176016349003044679402273e-1 +
+ (-0.200047621071909498730453e-3 -
+ 0.142077926378834722618091e-7*y2)*y2)*y2)/
+ (0.822091273968539282568011e0 +
+ (0.381641414288328849317962e0 +
+ (0.201562166026937652780575e-1 +
+ 0.2091140262529164482568557e-3*y2)*y2)*y2);
+ }
+ else
+ {
+ /* Use a [3,3] Remez approximation on [0.9,1]. */
+ z = y + y*y2*
+ (-0.227793870659088295252442e0 +
+ (-0.146173047288731678404066e-1 +
+ (-0.165597043903549960486816e-3 -
+ 0.115475878996143396378318e-7*y2)*y2)*y2)/
+ (0.683381611977295894959554e0 +
+ (0.317204558977294374244770e0 +
+ (0.167358775461896562588695e-1 +
+ 0.173076050126225961768710e-3*y2)*y2)*y2);
+ }
+ }
+ else
+ {
+ /* Compute p = exp(2*y) + 1. The code is basically inlined
+ from exp_amd. */
+
+ splitexp(2*y, 1.0, thirtytwo_by_log2, log2_by_32_lead,
+ log2_by_32_tail, &m, &z1, &z2);
+ p = scaleDouble_2(z1 + z2, m) + 1.0;
+
+ /* Now reconstruct tanh from p. */
+ z = (1.0 - 2.0/p);
+ }
+
+ if (xneg) z = - z;
+ return z;
+}
+
+weak_alias (__tanh, tanh)
diff --git a/src/tanhf.c b/src/tanhf.c
new file mode 100644
index 0000000..1cb14c4
--- /dev/null
+++ b/src/tanhf.c
@@ -0,0 +1,126 @@
+
+/*
+* Copyright (C) 2008-2009 Advanced Micro Devices, Inc. All Rights Reserved.
+*
+* This file is part of libacml_mv.
+*
+* libacml_mv is free software; you can redistribute it and/or
+* modify it under the terms of the GNU Lesser General Public
+* License as published by the Free Software Foundation; either
+* version 2.1 of the License, or (at your option) any later version.
+*
+* libacml_mv is distributed in the hope that it will be useful,
+* but WITHOUT ANY WARRANTY; without even the implied warranty of
+* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
+* Lesser General Public License for more details.
+*
+* You should have received a copy of the GNU Lesser General Public
+* License along with libacml_mv. If not, see
+* <http://www.gnu.org/licenses/>.
+*
+*/
+
+
+
+#include "../inc/libm_amd.h"
+#include "../inc/libm_util_amd.h"
+
+
+
+#define USE_SPLITEXPF
+#define USE_SCALEFLOAT_2
+#define USE_VALF_WITH_FLAGS
+#include "../inc/libm_inlines_amd.h"
+#undef USE_SPLITEXPF
+#undef USE_SCALEFLOAT_2
+#undef USE_VALF_WITH_FLAGS
+
+#include "../inc/libm_errno_amd.h"
+
+float FN_PROTOTYPE(tanhf)(float x)
+{
+ /*
+ The definition of tanh(x) is sinh(x)/cosh(x), which is also equivalent
+ to the following three formulae:
+ 1. (exp(x) - exp(-x))/(exp(x) + exp(-x))
+ 2. (1 - (2/(exp(2*x) + 1 )))
+ 3. (exp(2*x) - 1)/(exp(2*x) + 1)
+ but computationally, some formulae are better on some ranges.
+ */
+ static const float
+ thirtytwo_by_log2 = 4.6166240692e+01F, /* 0x4238aa3b */
+ log2_by_32_lead = 2.1659851074e-02F, /* 0x3cb17000 */
+ log2_by_32_tail = 9.9831822808e-07F, /* 0x3585fdf4 */
+ large_threshold = 10.0F; /* 0x41200000 */
+
+ unsigned int ux, aux;
+ float y, z, p, z1, z2, xneg;
+ int m;
+
+ /* Special cases */
+
+ GET_BITS_SP32(x, ux);
+ aux = ux & ~SIGNBIT_SP32;
+ if (aux < 0x39000000) /* |x| small enough that tanh(x) = x */
+ {
+ if (aux == 0)
+ return x; /* with no inexact */
+ else
+ return valf_with_flags(x, AMD_F_INEXACT);
+ }
+ else if (aux > 0x7f800000) /* |x| is NaN */
+ return x + x;
+
+ xneg = 1.0F - 2.0F * (aux != ux);
+
+ y = xneg * x;
+
+ if (y > large_threshold)
+ {
+ /* If x is large then exp(-x) is negligible and
+ formula 1 reduces to plus or minus 1.0 */
+ z = 1.0F;
+ }
+ else if (y <= 1.0F)
+ {
+ float y2;
+ y2 = y*y;
+
+ if (y < 0.9F)
+ {
+ /* Use a [2,1] Remez approximation on [0,0.9]. */
+ z = y + y*y2*
+ (-0.28192806108402678e0F +
+ (-0.14628356048797849e-2F +
+ 0.4891631088530669873e-4F*y2)*y2)/
+ (0.845784192581041099e0F +
+ 0.3427017942262751343e0F*y2);
+ }
+ else
+ {
+ /* Use a [2,1] Remez approximation on [0.9,1]. */
+ z = y + y*y2*
+ (-0.24069858695196524e0F +
+ (-0.12325644183611929e-2F +
+ 0.3827534993599483396e-4F*y2)*y2)/
+ (0.72209738473684982e0F +
+ 0.292529068698052819e0F*y2);
+ }
+ }
+ else
+ {
+ /* Compute p = exp(2*y) + 1. The code is basically inlined
+ from exp_amd. */
+
+ splitexpf(2*y, 1.0F, thirtytwo_by_log2, log2_by_32_lead,
+ log2_by_32_tail, &m, &z1, &z2);
+ p = scaleFloat_2(z1 + z2, m) + 1.0F;
+ /* Now reconstruct tanh from p. */
+ z = (1.0F - 2.0F/p);
+ }
+
+ return xneg * z;
+}
+
+
+weak_alias (__tanhf, tanhf)
diff --git a/testdata/exp.rephil_docs.builtin.baseline.trace b/testdata/exp.rephil_docs.builtin.baseline.trace
new file mode 100644
index 0000000..8344f12
--- /dev/null
+++ b/testdata/exp.rephil_docs.builtin.baseline.trace
Binary files differ
diff --git a/testdata/expf.fastmath_unittest.trace b/testdata/expf.fastmath_unittest.trace
new file mode 100644
index 0000000..c867b36
--- /dev/null
+++ b/testdata/expf.fastmath_unittest.trace
Binary files differ
diff --git a/testdata/log.rephil_docs.builtin.baseline.trace b/testdata/log.rephil_docs.builtin.baseline.trace
new file mode 100644
index 0000000..e87d631
--- /dev/null
+++ b/testdata/log.rephil_docs.builtin.baseline.trace
Binary files differ
diff --git a/testdata/notes.txt b/testdata/notes.txt
new file mode 100644
index 0000000..8b5884f
--- /dev/null
+++ b/testdata/notes.txt
@@ -0,0 +1,23 @@
+The traces in this directory are used for validating and testing
+performance of the math library. Each file contains the input
+arguments to the specific math functions, written in raw binary
+format.
+
+exp,log,pow are collected from the Perflab benchmark
+compiler/rephil/docs/v7 and expf is collected from
+util/math:fastmath_unittest.
+
+The traces were collected by linking in a small library that wrote
+that first 4M arguments to file before returning the actual value.
+ - Library was added as a dep to "base:base".
+ - To avoid write samples for genrules, the profiling was guarded by a
+ macro that was defined using --copt.
+ - Tcmalloc holds a lock while it calls log(), so care had to be taken
+ not to cause a deadlock in the profiling of log().
+ For the other functions, the actual value could be calculated
+ using something like this:
+ _exp = (double (*)(double)) dlsym(RTLD_NEXT, "exp");
+ return _exp(x);
+ for log(), we made the following call:
+ return log10(x)/log10(2.71828182846);
+
diff --git a/testdata/pow.rephil_docs.builtin.baseline.trace b/testdata/pow.rephil_docs.builtin.baseline.trace
new file mode 100644
index 0000000..b7a9722
--- /dev/null
+++ b/testdata/pow.rephil_docs.builtin.baseline.trace
Binary files differ