Data Persistence
Data Persistence
1. Role of data in information systems indicating the need for data persistence
What Is an Information System?An information system (IS) is a set of components that work together to manage data processing and storage. Its role is to support the key aspects of running an organization, such as communication, record-keeping, decision making, data analysis and more. Companies use this information to improve their business operations, make strategic decisions and gain a competitive edge.
All information systems require the input of data in order to perform organizational activities. Data, as described by Stair and Reynolds (2006), is made up of raw facts such as employee information, wages, and hours worked, barcode numbers, tracking numbers or sale numbers. The scope of data collected depends on what information needs to be extrapolated for maximum efficiency.
Data storage is the collective methods and technologies that capture and retain digital information on electromagnetic, optical or silicon-based storage media. Storage is a key component of digital devices, as consumers and businesses have come to rely on it to preserve information ranging from personal photos to business-critical information.
Persistent data is data that’s considered durable at rest with the coming and going of software and devices. Master data that’s stable and that is set and recoverable whether in flash or in memory.
- Non-volatile. Persists in the face of a power outage.
- Data that is set and recoverable whether in flash or memory backed.
- Data considered durable at rest with the coming and going of hardware and devices. There’s a persistence layer at which you hold your data at risk.
- it doesn’t change and is not accessed very frequently.
- Master data that’s stable.
2. Data, Database, Database Server, and Database Management System
DataIn computing, data is information that has been translated into a form that is efficient for movement or processing. Relative to today's computers and transmission media, data is information converted into binary digital form. It is acceptable for data to be used as a singular subject or a plural subject. Raw data is a term used to describe data in its most basic digital format.
Database
A database is a collection of information that is organized so that it can be easily accessed, managed and updated.
Data is organized into rows, columns and tables, and it is indexed to make it easier to find relevant information. Data gets updated, expanded and deleted as new information is added. Databases process workloads to create and update themselves, querying the data they contain and running applications against it.
Database Server
A database server is a computer system that provides other computers with services related to accessing and retrieving data from a database. Access to the database server may occur via a "front end" running locally a user's machine (e.g., phpMyAdmin), or "back end" running on the database server itself, accessed by remote shell. After the information in the database is retrieved, it is outputted to the user requesting the data.
Database Management System
A database management system (DBMS) is a software package designed to define, manipulate, retrieve and manage data in a database. A DBMS generally manipulates the data itself, the data format, field names, record structure and file structure. It also defines rules to validate and manipulate this data. A DBMS relieves users of framing programs for data maintenance. Fourth-generation query languages, such as SQL, are used along with the DBMS package to interact with a database.
3. Files and Databases, discussing pros and cons of them
Pros of the File System- Performance can be better than when you do it in a database.
- Saving the files and downloading them in the file system is much simpler than it is in a database.
- Migrating the data is an easy process.
- It's cost effective in most cases to expand your web server rather than pay for certain databases.
- It's easy to migrate it to cloud storage i.e. Amazon S3, CDNs, etc. in the future.
- Loosely packed. There are no ACID (Atomicity, Consistency, Isolation, Durability) operations in relational mapping, which means there is no guarantee.
- Low security.
- ACID consistency, which includes a rollback of an update that is complicated when files are stored outside the database.
- Files will be in sync with the database and cannot be orphaned, which gives you the upper hand in tracking transactions.
- Backups automatically include file binaries.
- It's more secure than saving in a file system.
- You may have to convert the files to blob in order to store them in the database.
- Database backups will be more hefty and heavy.
- Memory is ineffective.
4. Different arrangements of data
Structured data usually resides in relational databases (RDBMS). Fields store length-delineated data phone numbers, Social Security numbers, or ZIP codes. Even text strings of variable length like names are contained in records, making it a simple matter to search. Data may be human- or machine-generated as long as the data is created within an RDBMS structure. This format is eminently searchable both with human generated queries and via algorithms using type of data and field names, such as alphabetical or numeric, currency or date.Unstructured data is essentially everything else. Unstructured data has internal structure but is not structured via pre-defined data models or schema. It may be textual or non-textual, and human- or machine-generated. It may also be stored within a non-relational database like NoSQL.
Human generated :-
- Text files
- Social Media
- Website
- Mobile data
- Communications
- Media
- Business applications
- Satellite images
- Scientific data
- Photographs and video
- Radar or sonar data
Semi-structured data is information that doesn’t reside in a relational database but that does have some organizational properties that make it easier to analyze. With some process you can store them in relation database (it could be very hard for somme kind of semi structured data), but the semi structure exist to ease space, clarity or compute…
Examples of semi-structured : CSV but XML and JSON documents are semi structured documents, NoSQL databases are considered as semi structured.
5. Different types of databases
Hierarchical DatabasesIn a hierarchical database management systems (hierarchical DBMSs) model, data is stored in a parent-children relationship nodes. In a hierarchical database, besides actual data, records also contain information about their groups of parent/child relationships.
Network Databases
Network database management systems (Network DBMSs) use a network structure to create relationship between entities. Network databases are mainly used on a large digital computers. Network databases are hierarchical databases but unlike hierarchical databases where one node can have one parent only, a network node can have relationship with multiple entities. A network database looks more like a cobweb or interconnected network of records.
Relational Databases
In relational database management systems (RDBMS), the relationship between data is relational and data is stored in tabular form of columns and rows. Each column if a table represents an attribute and each row in a table represents a record. Each field in a table represents a data value.
Object-Oriented Model
In this Model we have to discuss the functionality of the object oriented Programming. It takes more than storage of programming language objects. Object DBMS's increase the semantics of the C++ and Java.I t provides full-featured database programming capability, while containing native language compatibility. It adds the database functionality to object programming languages. This approach is the analogical of the application and database development into a constant data model and language environment. Applications require less code, use more natural data modeling, and code bases are easier to maintain. Object developers can write complete database applications with a decent amount of additional effort.
Graph Databases
Graph Databases are NoSQL databases and use a graph structure for sematic queries. The data is stored in form of nodes, edges, and properties. In a graph database, a Node represent an entity or instance such as customer, person, or a car. A node is equivalent to a record in a relational database system. An Edge in a graph database represents a relationship that connects nodes. Properties are additional information added to the nodes.
Document Databases
Document databases (Document DB) are also NoSQL database that store data in form of documents. Each document represents the data, its relationship between other data elements, and attributes of data. Document database store data in a key value form.
6. Data warehouse with Big data
What is data warehouse?Data Warehousing is extracting data from one or more homogeneous or heterogeneous data sources, transforming the data and loading that into a data repository to do data analysis which helps in taking better decisions to improve one’s performance and can be used for reporting.
What is big data?
Big data refers to volume, variety, and velocity of the data. How big is the data, the speed at which it is coming and a variety of data determines so-called “Big Data”. The 3 V’s of the big data was articulated by industry analyst Doug Laney in the early 2000s.
Differences
- Data Warehouse is an architecture of data storing or data repository. Whereas Big Data is a technology to handle huge data and prepare the repository.
- Any kind of DBMS data accepted by Data warehouse, whereas Big Data accept all kind of data including transnational data, social media data, machinery data or any DBMS data.
- Data warehouse only handles structure data (relational or not relational), but big data can handle structure, non-structure, semi-structured data.
- Big data normally used a distributed file system to load huge data in a distributed way, but data warehouse doesn’t have that kind of concept.
7. Application components communicate with files and databases
- File – File path, URL
- Using file path or URL we can access to some particular resources and add or modify using application/ Software.
- DB – connection string
- We have to establish the connection string prior to connect to database. After successfully establish connection between Database and application. We can use any functionality to data in Database.
8. SQL statements, Prepared statements, and Callable statements
SQL StatementsExecute standard SQL statements from the application
Statement stmt = con.createStatement();
stmt.executeUpdate(“update STUDENT set NAME =” +
name +
“ where ID =” +
id + “)”;
Prepared statements
The query only needs to be parsed (or prepared) once, but can be executed multiple times with the same or different parameters.
PreparedStatement pstmt = con.prepareStatement("update STUDENT set NAME = ?
where ID = ?");
pstmt.setString(1, "MyName");
pstmt.setInt(2, 111);
pstmt.executeUpdate();
Callable statements
Execute stored procedures
CallableStatement cstmt = con.prepareCall("{call
anyProcedure(?, ?, ?)}");
cstmt.execute();
9. Need for ORM, explaining the development with and without ORM
What is ORM?Object-relational mapping (ORM) is a mechanism that makes it possible to address, access and manipulate objects without having to consider how those objects relate to their data sources. ORM lets programmers maintain a consistent view of objects over time, even as the sources that deliver them, the sinks that receive them and the applications that access them change.
PROS
- Facilitates implementing domain model pattern.
- Huge reduction in code.
- Takes care of vendor specific code by itself.
- Cache Management — Entities are cached in memory thereby reducing load on the DB.
- Increased startup time due to metadata preparation( not good for desktop applications).
- Huge learning curve without ORM.
- Relatively hard to fine tune and debug generated SQL.Not suitable for applications without a clean domain object model.
10. POJO, Java Beans, and JPA
POJO- It doesn’t have special restrictions other than those forced by Java language.
- It doesn’t provide much control on members.
- It can implement Serializable interface.
- Fields can be accessed by their names.
- Fields can have any visiblity.
- There can be a no-arg constructor.
- It is used when you don’t want to give restriction on your members and give user complete access of your entity
JAVA BEAN
- It is a special POJO which have some restrictions.
- It provides complete control on members.
- It should implement serializable interface.
- Fields are accessed only by getters and setters.
- Fields have only private visiblity.
- It must have a no-arg constructor.
- It is used when you want to provide user your entity but only some part of your entity.
JPA
- it is EJB 3.0-compliant;
- it is light-weight;
- it manages persistent data in concert with a JPA entity manager;
- it performs complex business logic;
- it potentially uses several dependent Java objects;
- it can be uniquely identified by a primary key.
11. ORM tools available for different development platforms (Java, PHP, and .Net)
- C++ :- ODB, QxOrm
- Java :- ActiveJDBC, ActiveJPA, Apache Cayenne, Apache Gora, Athena Framework, Carbonado
- .NET :- Base One Foundation Component Library, DatabaseObjects, DataObjects.NET, Dapper, ECO, Entity Framework
- PHP :- CakePHP, CodeIgniter,Doctrine, FuelPHP
- Python :- Django,SQLAlchemy, SQLObject, Storm
12. Need for NoSQL
NoSql solves the problem of scalability and availability against that of atomicity or consistency. So According to CAP(Consistency, Availability and Tolerance to network partitions) theorem for shared-data systems, only two can be achieved at any time.NoSql approach to store data and querying is quite better :
- Schemaless data representation
- Development time
- Speed
- Plan ahead for scalability
- MongoDB
- Redis
- Couch DB
- RavenDB
- MemcacheDB
- Riak
- Neo4j
13. What Hadoop is, explaining the core concepts of it
What is Hadoop?Hadoop is the open source project which takes care of all the above points for distributed computing. It is completely based on the concept of Google File System and MapReduce. The core concept of Hadoop is to divide data into tiny chunks and distribute them to all cluster nodes. And when computation process is executed on the cluster it process the data chunk which resides on that node.
- Hadoop Architecture distributes data across the cluster nodes by splitting it into small blocks.
- Every time when Hadoop process the data, each node connects to other node as much less as possible. This concept is known as “Data Locality”.
- As we know data is distributed across the nodes, so in order to increase data availability each block of the data is replicated (as per the configuration) over different nodes.
- Whenever MapReduce job (consist to typically two tasks Map Task and Reduce Task) is executed, Map tasks are executed on individual data blocks on each node (in most of the cases) and leverage “Data Locality”. This is how multiple nodes process data in parallel manner.
- If any node fails in between, the master will detect this failure and assign the same task to another node where the replica of the same data block is available.
14. Concept of IR, identifying tools for IR
Information retrieval, as the name implies, concerns the retrieving of relevant information from databases. It is basically concerned with facilitating the user's access to large amounts of (predominantly textual) information. The process of information retrieval involves the following stages:- Representing Collections of Documents - how to represent, identify and process the collection of documents.
- User-initiated querying - understanding and processing of the queries.
- Retrieval of the appropriate documents - the searching mechanism used to obtain and retrieve the relevant documents
- Apache Solr
- elasticsearch
- Algolia
- Sphinx (search engine)
- Site Search 360
- OpenSearchServer
- Xapian
- Manticore search
References
[1]Role of Information Systems in an Organization - https://bizfluent.com/about-6525978-role-information-systems-organization.html
[2]Data, Information and Knowledge in Information Systems - https://jennadoucet.wordpress.com/2010/03/14/data-information-and-knowledge-in-information-systems/
[3]data storage - https://searchstorage.techtarget.com/definition/storage
[4]What Is Persistent Data and Why Is it Important? - https://www.linkedin.com/pulse/what-persistent-data-why-important-c-thomas-tom-smith-iii
[5]data - https://searchdatamanagement.techtarget.com/definition/data
[6]database (DB) - https://searchsqlserver.techtarget.com/definition/database
[7]Database server - https://www.computerhope.com/jargon/d/database-server.htm
[8]Database Management System (DBMS) - https://www.techopedia.com/definition/24361/database-management-systems-dbms
[9]File System vs. Database - https://dzone.com/articles/which-is-better-saving-files-in-database-or-in-fil
[10]Structured vs. Unstructured Data - https://www.datamation.com/big-data/structured-vs-unstructured-data.html
[11]STRUCTURED, SEMI STRUCTURED AND UNSTRUCTURED DATA - https://jeremyronk.wordpress.com/2014/09/01/structured-semi-structured-and-unstructured-data/
[12]Types Of Database Management Systems - https://www.c-sharpcorner.com/UploadFile/65fc13/types-of-database-management-systems/
[13]Will Big Data Replace Data Warehouse? - https://analyticstraining.com/willl-big-data-replace-data-warehouse/
[14]Should I Or Should I Not Use ORM ? - https://medium.com/@mithunsasidharan/should-i-or-should-i-not-use-orm-4c3742a639ce
[15]What is a JPA Entity? - https://docs.oracle.com/cd/E16439_01/doc.1013/e13981/undejbs003.htm
[16]POJO vs Java Beans - https://www.geeksforgeeks.org/pojo-vs-java-beans/
[17]Brief Introduction of Hadoop : The Bazics - https://backtobazics.com/big-data/hadoop/brief-introduction-of-hadoop-the-bazics/
[18]Information Retrieval - https://www.doc.ic.ac.uk/~nd/surprise_97/journal/vol4/hks/inf_ret.html
Thank you for read till the end of the Post. See you in another Blog Post.
Comments
Post a Comment