Among interpreted languages, for various historical and cultural reasons, Python has developed a large and active scientific computing and data analysis community. I don’t like the term “scripting languages,” as it carries a connotation that they cannot be used for building serious software. Such languages are often called scripting languages, as they can be used to quickly write small programs, or scripts to automate other tasks. Python and Ruby have become especially popular since 2005 or so for building websites using their numerous web frameworks, like Rails (Ruby) and Django (Python). Since its first appearance in 1991, Python has become one of the most popular interpreted programming languages, along with Perl, Ruby, and others. 1.2 Why Python for Data Analysis?įor many people, the Python programming language has strong appeal. Most users of spreadsheet programs like Microsoft Excel, perhaps the most widely used data analysis tool in the world, will not be strangers to these kinds of data. As an example, a collection of news articles could be processed into a word frequency table, which could then be used to perform sentiment analysis. If not, it may be possible to extract features from a dataset into a structured form. Even though it may not always be obvious, a large percentage of datasets can be transformed into a structured form that is more suitable for analysis and modeling. Multiple tables of data interrelated by key columns (what would be primary or foreign keys for a SQL user). This includes most kinds of data commonly stored in relational databases or tab- or comma-delimited text files. Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise). When I say “data,” what am I referring to exactly? The primary focus is on structured data, a deliberately vague term that encompasses many different common forms of data, such as: Some might characterize much of the content of the book as "data manipulation" as opposed to "data analysis." We also use the terms wrangling or munging to refer to data manipulation. My hope is that this book serves as adequate preparation to enable you to move on to a more domain-specific resource. There are now many other books which focus specifically on these more advanced methodologies. The Python open source ecosystem for doing data analysis (or data science) has also expanded significantly since then. Sometime after I originally published this book in 2012, people started using the term data science as an umbrella description for everything from simple descriptive statistics to more advanced statistical analysis and machine learning. This is the Python programming you need for data analysis. While "data analysis" is in the title of the book, the focus is specifically on Python programming, libraries, and tools as opposed to data analysis methodology. My goal is to offer a guide to the parts of the Python programming language and its data-oriented library ecosystem and tools that will equip you to become an effective data analyst. This book is concerned with the nuts and bolts of manipulating, processing, cleaning, and crunching data in Python. The code examples are MIT licensed and can be found on GitHub or Gitee. The content from this website may not be copied or reproduced. If you find the online edition of the book useful, please consider ordering a paper copy or a DRM-free eBook to support the author. If you encounter any errata, please report them here. This Open Access web version of Python for Data Analysis 3rd Edition is now available as a companion to the print and digital editions.
0 Comments
Leave a Reply. |